محمد نظري

عنوان

رده‌بندي اسناد با استفاده از مدل موضوع آگاه از زمينه

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندسي كامپيوتر- نرم افزار

سال تحصيل

98-401

تاريخ دفاع

1401/06/20

استاد راهنما

حسين رحماني

دانشكده

كامپيوتر

چكيده

حجم داده‌هاي موجود در جهان به صورت روزانه در حال افزايش است. جستجو، فيلتر و پيدا كردن مطالب مورد علاقه كاربران در اين فضاي عظيم، امري دشوار و چالش برانگيز است. رده‌بندي مي‌تواند با كوچك‌كردن فضاي جستجو و دسته‌بندي موضوعات در اين زمينه بسيار مفيد باشد. امروزه به خصوص با پيشرفت‌هاي اخير در پردازش زبان‌هاي طبيعي، بسياري از محققان، اكنون علاقه‌مند به توسعه برنامه‌هايي هستند كه از روش‌هاي رده‌بندي متن استفاده مي‌كنند. روش‌هاي مختلفي تا كنون براي رده‌بندي اسناد معرفي شده ‌است كه مي‌توان به روش‌هاي سنتي و روش‌هاي مبتني بر شبكه‌هاي عصبي اشاره كرد. در روش‌هاي سنتي به علت بالا بودن ابعاد و تنك بودن بردارهاي بازنمايي اسناد، هزينه محاسبات رده‌بندها بالا و دقت آن‌ها پايين است. علاوه بر آن در روش‌هاي سنتي ارتباط معنايي بين كلمات در نظر گرفته نمي‌شود. در روش‌هاي مبتني بر شبكه‌هاي عصبي كه به تعبيه كلمات معروف هستند، هر كلمه در ابعاد ثابت بازنمايي مي‌شود. روش‌هاي مبتني بر تعبيه كلمات و جملات، هنگامي كه طول اسناد زياد باشد، به علت نزديك شدن بردارها به يكديگر، تفكيك اسناد با استفاده از معيارهاي شباهت دشوار است. از طرف ديگر در اين روش‌ها به صورت محلي به كلمات نگاه مي‌كنند و ارتباط سراسري بين كلمات در نظر گرفته نمي‌شود. بنابراين، ما در اين پژوهش يك روش براي رده‌بندي اسناد برپايه مدل موضوعي LDA كه با استفاده از تعبيه كلمه Word2vec زمينه كلمات را در نظر مي‌گيرد، معرفي خواهيم كرد. اين روش از تركيب LDA و Word2vec به منظور در نظر گرفتن هر دو ويژگي محلي و هم سراسري كلمات در متن استفاده مي‌كند. و در ادامه داده‌ها را به صورت گراف مدل مي‌كنيم و بعد با استفاده از خودرمزگذار گرافي به رده‌بندي داده‌ها مي‌پردازيم. ما در اين پژوهش از مجموعه‌داده‌ي خلاصه طرح فيلم‌ها استفاده كرديم تا آن‌‌ها را براساس ژانر‌هايشان رده‌بندي كنيم. نتايج حاصل از رده‌بندي و بررسي‌هاي انجام شده بر روي گراف ساخته‌شده نشان از برتري مدل معرفي شده نسبت به روش‌هاي قبلي دارد. به طور كلي مي‌توان گفت كه نتايج حاصل از رده‌بندي نشان از افزايش 7 درصدي دقت نسبت به كارهاي پيشين دارد. ما همچنين با استفاده از مدل معرفي شده در سيستم‌هاي توصيه‌گر فيلم باعث رفع مشكل شروع سرد در آن‌ها شديم.

تاريخ ورود اطلاعات

1401/07/18

عنوان به انگليسي

Document Classification using Context-aware Topic Model

تاريخ بهره برداري

9/11/2023 12:00:00 AM

دانشجوي وارد كننده اطلاعات

محمد نظري

Name: محمد نظري
Author: محمد نظري

چكيده به لاتين

Nowadays, the volume of data is increasing. Searching, filtering, and finding the content of interest to users in this huge space is difficult and challenging. Categorization can be very useful in this field by narrowing the search space and categorizing topics. Especially with recent advances in natural language processing, many researchers are now interested in developing programs that use text classification methods. Various methods have been introduced so far for document classification, which can be mentioned as traditional methods and methods based on neural networks. In traditional methods, due to the high dimensions and the sparsity of the document representation vectors, the cost of calculations is high and their accuracy is low. In addition, in traditional methods, the semantic relationship between words is ignored. In methods based on neural networks, which are known as word embedding, each word is represented in fixed dimensions. Methods based on words and sentence embedding, when the documents are large, it is difficult to recognize the difference between documents using similarity measures due to the fact that the vectors are close to each other. On the other hand, in these methods, words are looked at locally and the global connection between words is ignored. Therefore, in this research, we introduce a method for document classification using a combination of LDA and Word2vec in order to consider both local and global features of words in the text. Then we model the data in the form of a graph and then classify the data using a graph autoencoder. In this research, we used the plot synopsis of movies to classify them according to their genres. The results of the classification and eva‎luation graph show the superiority of the introduced model compared to the previous methods. In general, it can be said that the classification results show a 7% increase in the f-score compared to previous works. We also solved the problem of cold start by using the model introduced in movie recommender systems.

كليدواژه هاي فارسي

متن‌كاوي , رده‌بندي اسناد , گراف شباهت , استخراج ويژگي از متن , مدل موضوعي آگاه از زمينه

كليدواژه هاي لاتين

Text mining , Text Classification , Text feature extraction , Context-aware Topic Model

Author

Mohammad Nazari

SuperVisor

Hossein Rahmani

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=27118&Field=0&DTC=6