چكيده به لاتين
Extracting key phrases is the core of automated processes that is done on text data. Therefore, in text mining algorithms, extracting key phrases are very important. In fact, extracting key phrases is a fundamental step in most of text mining projects, which researching on it and finding out the best way to extract key phrases from text that have the good accuracy and execution time, has a special importance. In fact these key phrases are used for categorizing, clustering, indexing, searching, summarizing, defining the semantic similarity of textual documents, and almost all other areas of text mining.
In this research, a new algorithm is proposed, which, in addition to the high speed of extracting key phrases, is more accurate than other algorithms in this field. In order to remove stopwords, an optimal and suitable list of stopwords is presented, that will increase the accuracy and speed of removing the stopwords from the input text. In addition, for extracting proper key phrases, a linguistic approach has been used. This approach, using both syntactic and lexical ways, identifies the suitable candidate pharses for processing and extracting key phrases.
Beside optimizations that mentioned above, other optimizations also are performed on all the algorithms proposed in this research, including TF-IDF and RAKE algorithms, and new algorithms like TFIDF-1-TEXT and optimized-RAKE are presented. Using four criteria including precision, Recall, F-score and Jaccard's similarity coefficient (JSC), it will be shown that these two algorithms provide better results than other algorithms. It will be determined in the experiments that TFIDF-1-TEXT algorithm is better than TF-IDF algorithm and some other algorithms. It will also be shown that optimized-RAKE algorithm is better than all the algorithms proposed in this research and it produces a more proper and precise output.
Keywords: text mining, key phrase extraction, noun phrase, Part-Of-Speech TAG, Term Frequency, Co-occurrence matrix, regular expression, Natural Language Processing