بهروز جان‌فدا

عنوان

الگوريتمي براي ساده‌سازي متن در زبان فارسي و كاربرد آن در بهبود الگوريتم هاي استخراج رابطه

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندسي كامپيوتر - نرم‌افزار

سال تحصيل

1400

تاريخ دفاع

1400/01/31

استاد راهنما

دكتر بهروز مينايي بيدگلي

دانشكده

مهندسي كامپيوتر

چكيده

ساده‌سازي متن از زمينه‌هاي در حال توسعه در پردازش زبان طبيعي است و با بهبود امكانات سخت‌افزاري، پيشرفت روش‌ها و تنوع كاربرد‌ها، مورد توجه روزافزون قرار گرفته است. ساده‌سازي متن فرايندي است كه طي آن جملات زبان طبيعي به شيوه‌اي تغيير داده مي‌شوند كه پيچيدگي‌شان كاهش و خوانايي و فهم‌پذيري‌شان افزايش يابد. خودكارسازي اين فرايند دشوار است و الگوريتم‌هاي پيشنهادي در اين حوزه تلاش دارند تا با كمترين خطا، بيشترين كاهش پيچيدگي و بيشترين افزايش خوانايي و فهم‌پذيري را فراهم كنند. از طرف ديگر متون ورودي در فرايند‌هاي متن‌كاوي عموماً مجموعه‌اي از جملات پيچيده‌ي زبان طبيعي هستند كه تشخيص ويژگي‌هاي دستوري و واژگاني اين جملات را براي الگوريتم‌هاي متن‌كاو دشوار مي‌كنند و ميزان خطا در نتايج را بالا مي‌برند. از راه‌هاي كاهش ميزان اين خطا استفاده از الگوريتم‌هاي ساده‌سازي متن به‌عنوان يكي از وظايف پيش‌پردازش متن در الگوريتم‌هاي متن‌كاو است كه باعث كاهش پيچيدگي ورودي شده و در نتيجه خطاي الگوريتم متن‌كاو كاهش و بازخواني آن افزايش داده مي‌شود. در زبان فارسي الگوريتمي براي ساده‌سازي متن ارائه نشده است. همچنين الگوريتم‌هاي استخراج رابطه و استخراج دانش موجود در زبان فارسي نياز به بهبود دارند. در اين پژوهش نخستين الگوريتم ساده‌سازي متن در زبان فارسي را ارائه مي‌دهيم. از آنجا كه الگوريتم‌هاي موجود در مرز‌هاي دانش در زبان‌هاي ديگر، بر اساس پيكره‌هاي آموزشي موجود در آن زبان‌ها آموزش داده مي‌شوند و چنين پيكره‌اي در زبان فارسي موجود نيست و در عين حال توليد چنين پيكره‌اي پرهزينه و زمان‌بر است، الگوريتم پيشنهاد شده يك الگوريتم بي‌ناظر و بدون نياز به مجموعه دادگان آموزشي است. اين الگوريتم يك سامانه قاعده‌محور است كه قواعد ساده‌سازي در آن به كمك گونه خاصي از عبارات منظم بر روي ويژگي‌هاي متن (مثلاً ويژگي‌هاي دستوري) و به كمك كاربران خبره طراحي شده است. براي ارزيابي، اين الگوريتم را به عنوان يك سامانه پيش‌پردازشي براي الگوريتم‌هاي موجود استخراج رابطه به كار بستيم و نتايج را در مقايسه با نتايج الگوريتم استخراج رابطه بدون استفاده از اين پيش‌پردازش مورد ارزيابي و مقايسه قرار داديم و نشان داديم كه استفاده از الگوريتم ساده‌ساز متن به عنوان يك وظيفه پيش‌پردازشي، نتايج الگوريتم استخراج رابطه را بهبود مي‌دهد.

تاريخ ورود اطلاعات

1400/12/11

عنوان به انگليسي

Text Simplification, Relation Extraction, Knowledge Extraction, Natural Language Processing, Persian Language

تاريخ بهره برداري

4/20/2022 12:00:00 AM

دانشجوي وارد كننده اطلاعات

بهروز جانفدا

Name: بهروز جانفدا
Author: بهروز جانفدا

چكيده به لاتين

Text simplification is one of the evolving fields in natural language processing and has received increasing attention with the improvement of hardware facilities, the development of methods, and various applications. Text simplification is the process by which natural language sentences are modified to reduce their complexity and increase their readability and comprehensibility. Automating this process is difficult, and the proposed algorithms in this area try to provide a lower error rate, more complexity reduction, and higher readability and comprehensibility. Input texts in text mining methods, on the other hand, are generally a set of complex natural language sentences that make it difficult for text mining algorithms to recognize the grammatical and lexical properties of these sentences and increase the error rate in the results. One way to reduce this error is to use text simplification algorithms as one of the text pre-processing tasks for text mining algorithms, resulting in reducing the input's complexity, reducing the text mining algorithm's error rate, and increasing its recall. Previously, a text simplification algorithm has not been presented in Persian. On the other hand, currently, the relation extraction and knowledge extraction algorithms in Persian need to be improved. In this thesis, we present the first text simplification algorithm in the Persian language. Since the state of the art algorithms in other languages are trained based on the labeled corpora and such corpus is not available in Persian, and having in mind that the creation of such corpus is very costly and time-consuming, the algorithm proposed in this research is an unsupervised method, without the need for such corpora. This algorithm is a rule-based system in which simplification rules are designed utilizing a specific type of regular expressions on text features (for example, grammatical features) and expert users' help. For evaluation, we used this algorithm as a pre-processing operation for an existing relation extraction method and compared the results with the results of the relation extraction method without using this pre-processing and showed that the use of text simplification algorithm as a pre-processing task improves the results of the mentioned relation extraction method.

كليدواژه هاي فارسي

ساده‌سازي متن , قاعده محور , عبارات منظم , استخراج رابطه , پردازش زبان طبيعي , زبان فارسي

كليدواژه هاي لاتين

Text Simplification , Rule Based , Regular Expressions , Relation Extraction , Natural Language Processing , Persian Language

Author

Behrooz Janfada

SuperVisor

Dr. Behrooz Minaei-Bidgoli

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=26164&Field=0&DTC=6