چكيده به لاتين
Abstract
With the widespread use of news websites, blogs, social networks and question-answering systems, web-based content generation is increasing. Generated contents of these websites are very similar which is the reason of why near duplicated pages have been existed on the Internet. The appearance of such near duplicated pages in the results of search engines has reduced the efficiency of these systems. Therefore, it is necessary to have a mechanism that can detect near duplicated documents. Identifying near duplicates from the massive number of documents is a time-consume and expensive task. The goal of this thesis is to propose a method that can detect near duplicated documents rapidly and accurately.
The proposed method operates based on two similarity functions: low-cost and expensive. Initially, based on the low-cost function, documents are partitioned into groups such that the probability that documents of different groups are being near duplicates will be very low. Since the low-cost similarity measure is used, the massive number of documents is partitioned to small parts quickly. Then, the expensive similarity measure is applied to documents of each part separately. Because of the small size of each partition, identification of near duplicated documents in each part is performed rapidly, too. We use words which have appeared only in one document (unique words) and have occurred in small number of documents (less-frequent words) as low-cost and expensive similarity measures respectively. This idea is in contrast with previous methods that were used frequent-words as their similarity measures.
Evaluation of the proposed method over two datasets showed that less-frequent words are rich features of a document for detection of near duplicates. In addition, since the frequency of these words in the corpus is lesser than frequent words; the amount of memory that is required is very low. The results also showed that the document partitioning function can significantly increase speed of the proposed method without degrading of the accuracy.
Keywords: near duplicated documents, less-frequent, unique, low-cost and expensive similarity measure, partitioning