عاطفه پاكزاد

عنوان

بازنمايي جملات در فضاي معنايي با استفاده از روش‌هاي تخمين پارامتر

مقطع تحصيلي

دكتري

رشته تحصيلي

مهندسي كامپيوتر

سال تحصيل

1393-1400

تاريخ دفاع

1400/9/21

استاد راهنما

دكتر آنالويي

دانشكده

مهندسي كامپيوتر

چكيده

مدل‌هاي معنايي توزيعي، معناي كلمات را به صورت بردار بازنمايي مي‌كنند. براي بدست آوردن بردارهاي معنايي كلمه دو روش مبتني بر شمارش و مبتني بر پيش‌بيني وجود دارد. بردارهاي حاصل از روش‌هاي مبتني بر شمارش داراي ابعاد زيادي هستند و معمولا از روش‌هاي كاهش ابعاد براي كاستن ابعاد بردار كلمه استفاده مي‌شود. بردارهاي مبتني بر پيش‌بيني با استفاده از روش‌هاي يادگيري عميق تعبيه‌هاي كلمه فشرده با ابعاد كم توليد مي‌كنند. اين بردارها كارايي خوبي در كاربردهاي NLP از خود ارائه مي‌كنند. مولفه‌هاي تعبيه كلمه اعداد حقيقي هستند و بردارهاي پايه معادل واژگاني ندارند. در بردارهاي كلمه بدست آمده با روش‌هاي مبتني بر شمارش، هر بعد معادل واژگاني دارد. اين بردارها با روش‌هاي كاهش ابعاد به بردارهاي ضمني تبديل مي‌شوند. ما در اين پژوهش با توجه به حوزه هوش مصنوعي توضيح‌پذير يك رويكرد تركيبي براي بازنمايي كلمه صريح با ابعاد كم پيشنهاد مي‌كنيم كه هر بردار پايه در فضاي معنايي معادل يك كلمه پايه است. اين رويكرد تركيبي ابعاد بردارهاي كلمه را به گونه‌اي كاهش مي‌دهد كه هر بعد يك معادل واژگاني داشته باشد و كارايي بردارهاي كلمه بر روي وظيفه شباهت كلمه افت نكند. در رويكرد تركيبي پيشنهادي، براي شمارش هم‌رخدادي‌هاي كلمه هدف و كلمه‌هاي بافتار، ايده به‌كارگيري از روش محلي‌سازي را پيشنهاد مي‌كنيم كه به جاي استفاده از پنجره با طول ثابت از يك تابع نمايي برحسب فاصله كلمه هدف و كلمه بافتار براي شمارش هم‌رخدادي بهره مي‌برد. ما دو معيار يعني شباهت كلمه و تعداد مولفه‌هاي صفر را علاوه بر فراواني كلمه، به عنوان ويژگي‌ براي كلمات پيكره معرفي مي‌كنيم. سپس تعدادي قاعده براي بدست آوردن كلمات پايه اوليه با استفاده از درخت تصميم رسم شده براساس سه ويژگي، استخراج مي‌كنيم. در اين رويكرد تركيبي از يك روش انتخاب كلمه براي يادگيري فضاي برداري استفاده مي‌كنيم كه هر يك از ابعادش يك كلمه طبيعي است. روش انتخاب كلمه از پرتكرارترين كلمه‌ها شروع مي‌كند و زيرمجموعه‌اي انتخاب مي‌كند كه داراي بهترين كارايي است. با استفاده از روش انتخاب كلمه 1000 كلمه پايه به دست مي‌آوريم. همچنين با استفاده از روش وزن‌دهي دودويي براساس الگوريتم بهينه‌سازي ازدحام ذرات دودويي، كلمات طلايي پيكره را انتخاب كرده و به عنوان كلمات طلايي بافتار به 1000 كلمه پايه انتخاب شده با روش انتخاب كلمه مي‌افزاييم. در اين پژوهش از پيكره ukWaC براي ساخت بردارهاي كلمه استفاده مي‌شود. ما بردارهاي كلمه صريح با ابعاد كم حاصل را بر روي وظيفه شباهت كلمه ارزيابي مي‌كنيم. همچنين، قابليت تفسيرپذيري بردارهاي كلمه صريح بدست آمده را به صورت كيفي و كمي ارزيابي مي‌نماييم. در آزمايش‌هاي اين پژوهش، نتايج ارزيابي بردارهاي كلمه بر روي وظيفه شباهت كلمه با نتايج مدل پايه مبتني بر شمارش كه داراي 5000 كلمه پرتكرار بافتار است و از پنجره ثابت به جاي روش محلي‌سازي براي شمارش هم‌رخدادي استفاده مي‌كند، مقايسه‌ مي‌شود. با مقايسه بردارهاي كلمه با ابعاد كم حاصل در مقايسه با بردارهاي مدل پايه، ضريب همبستگي اسپيرمن براي مجموعه‌هاي آزمون MEN، RG-65 و SimLex-999 به ترتيب به ميزان 4.66%، 15.23% و 3.27% افزايش مي‌يابد. همچنين قابليت تفسيرپذيري بردارهاي كلمه به صورت كيفي و كمي نسبت به مدل‌هاي مبتني بر پيش‌بيني به ميزان قابل ملاحظه‌اي افزايش مي‌يابد.

تاريخ ورود اطلاعات

1400/12/02

عنوان به انگليسي

Representation of sentences in Semantic Space by Parameter Estimation Methods

تاريخ بهره برداري

12/12/2022 12:00:00 AM

دانشجوي وارد كننده اطلاعات

عاطفه پاكزاد

Name: عاطفه پاكزاد
Author: عاطفه پاكزاد

چكيده به لاتين

Distributional semantic models represent the meaning of words as vectors. There are two models for obtaining semantic word vectors namely count-based and prediction-based models. Word vectors derived from count-based models have many dimensions. Usually, dimension reduction methods are used to reduce the word vector's dimensions. Prediction-based models produce compact word embeddings with low dimensions using deep learning methods. The word embeddings provide good performance in NLP applications. The word embedding components are real numbers, and the base vectors have no conceptual equivalent. In word vectors obtained by the count-based models, each dimension has a lexical equivalent. These vectors are transferred to the implicit vectors by dimension reduction methods. In this study, according to the field of explainable artificial intelligence, we propose a hybrid approach to represent the low-dimensional explicit word vectors that each base vector in the semantic space is equivalent to one basis word. The hybrid approach reduces the dimensions of word vectors in such a way that each dimension has a conceptual equivalent, and the word vector's performance do not diminish on the word similarity task. In the hybrid approach, we propose the idea of using a localization method for counting the co-occurrence of target words and context words. The localization method uses an exponential function based on the distance between the target word and the context word for counting the co-occurrence instead of considering a fixed-length window. We introduce the word similarity and number of zeroes criteria in addition to word frequency for the target words. Then, we extract some rules from the decision tree drawn based on three features for obtaining the initial basis words. In the hybrid approach, we use a word selection method to learn a vector space that each of its dimensions is a natural word. The word selection method starts from the most frequent words and selects a subset, which has the best performance. Then, we use the word selection method to get 1000 basis words. Also, we select golden words of the corpus using a binary weighting method based on the binary particle swarm optimization algorithm and add them to 1000 basis words selected by the word selection method as golden context words. In this study, we use the ukWaC corpus for constructing the word vectors. We evaluate the low-dimensional explicit word vectors on the word similarity task. Also, we evaluate the interpretability of the low-dimensional explicit word vectors qualitatively and quantitatively. In the experiments of this study, the evaluation results of word vectors are compared with the results of a count-based baseline model, which has 5,000 most frequent context words and uses a fixed window instead of the localization method on the word similarity task. The resulting low-dimensional explicit word vectors in comparison to the baseline model can increase the Spearman correlation coefficient for the MEN, RG-65, and SimLex-999 test sets by 4.66, 14.73, and 1.08%, respectively. Also, the interpretability of the resulting word vectors is increased qualitatively and quantitatively in comparison to the prediction-based models.

كليدواژه هاي فارسي

پردازش زبان طبيعي , هوش مصنوعي توضيح‌پذير , بردارهاي كلمه معنايي صريح , قابليت تفسيرپذيري , روش انتخاب مبتني بر قاعده , روش انتخاب كلمه

كليدواژه هاي لاتين

Natural language processing , explainable AI , explicit semantic word vectors , Interpretability , Rule-based selection method , word selection method

Author

عاطفه پاكزاد

SuperVisor

عاطفه پاكزاد

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=26115&Field=0&DTC=6