محمود كلانتري خليل آباد

عنوان

مهندسي ورودي متني در توليد آواتار در متاورس با استفاده از مدل‌هاي زباني

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

هوش مصنوعي و رباتيكز

سال تحصيل

1401

تاريخ دفاع

1404/9/29

استاد راهنما

ناصر مزيني

استاد مشاور

ندارد

دانشكده

مهندسي كامپيوتر

چكيده

با افزايش كاربرد مدل‌هاي زباني، نياز به توسعه مدل‌هايي مطابق با پيشرفت‌هاي جديد ايجاد شده است. زبان فارسي به دليل ساختار صرفي ـ نحوي پيچيده، تنوع در نوشتار متن و كمبود داده‌هاي تميز در مقياس بزرگ همچنان با چالش‌هايي در حوزه پردازش زبان طبيعي روبه‌رو است. اين پايان‌نامه باهدف كاهش اين محدوديت‌ها، به توسعه و پيش‌آموزش مدل رمزگذار با معماري جديد و مطابق با پيشرفت‌هاي بروز مبتني بر معماري ModernBERT بر روي پيكره بزرگي از داده‌هاي متني فارسي پرداخته است و آن را به‌عنوان زيرساختي براي طراحي ورودي‌هاي متني در سامانه‌هاي توليد آواتار متاورسي ارائه مي‌كند. در اين پايان‌نامه، ابتدا مجموعه‌داده بزرگي شامل بيشر از 100 ميليارد تكواژ از متون فارسي در حوزه‌هاي خبري، كتاب‌ها، مقالات، وبلاگ‌ها و شبكه‌هاي اجتماعي جمع‌آوري شد. سپس داده‌ها در فرايند پاك‌سازي در چند مرحله‌، شامل نرمال‌سازي، حذف نويزهاي ساختاري مانند علائم و ايموجي‌ها و حذف جملات تكراري، براي آموزش مدل‌هاي زباني پيش‌پردازش شد. يك تكواژساز جديد مبتني بر الگوريتم جفت نماد و با اندازه 50 هزار كلمه ساخته و آموزش داده شد تا به صورت بهينه كلمات جدا شود. مدل كدگذار در سه‌فاز جداگانه و با افزايش طول دنباله متني ورودي از 512 به 1024 و سپس به 8192 تكواژ آموزش يافت تا امكان پردازش متون بلند ايجاد شود. نتايج حاصل از ارزيابي مدل بر روي وظايف پردازش زبان طبيعي، شامل تحليل احساس، طبقه‌بندي متن، تشخيص موجوديت‌هاي نامدار، استنتاج زبان طبيعي، برچسب‌گذاري اجزاي گفتار و پاسخ به پرسش نشان داد كه مدل توسعه‌يافته در اغلب وظايف پردازش زبان طبيعي عملكردي رقابتي و در وظايف مانند طبقه‌بندي متن و تحليل احساس، تا 6 درصد عملكرد بهتري ارائه مي‌دهد. همچنين، افزايش طول متن مؤثر، توانايي مدل را در پردازش اسناد بلند، به طور قابل‌توجهي بهبود داده است. اين پايان‌نامه نخستين تلاش براي آموزش يك مدل با معماري به روز براي زبان فارسي است و مي‌تواند بستر اصلي توسعه ابزارهاي مبتني بر مدل‌هاي كدگذار، مانند موتورهاي جستجوي معنايي، مدل‌هاي دسته‌بندي و تحليل متون، سيستم‌هاي توليد آواتار و سامانه‌هاي بازيابي اطلاعات باشد. در نهايت براي مسير پژوهشي آينده، توسعه نسخه‌هاي معماري مدل تركيب متخصصان، گسترش داده‌هاي تخصصي و ارتقاي ارزيابي با داده‌هاي تخصصي استاندارد فارسي پيشنهاد شده مي‌شود.

تاريخ ورود اطلاعات

1405/02/26

عنوان به انگليسي

pro‎mp‎t engineering for avatar generation in the metaverse using language models

تاريخ بهره برداري

1/1/1900 12:00:00 AM

دانشجوي وارد كننده اطلاعات

محمود كلانتري خليل اباد

Name: محمود كلانتري خليل اباد
Author: محمود كلانتري خليل آباد

چكيده به لاتين

With the increasing use of language models, the need to develop models aligned with recent ad vancements has emerged. Due to its complex morphological–syntactic structure, variability in writ ing conventions, an‎d the lack of large-scale clean datasets, the Persian language still faces challenges in natural language processing. This study aims to mitigate these limitations by developing an‎d pre training an encoder model with a modern architecture, based on the ModernBERT framework, on a large corpus of Persian textual data, an‎d introducing it as an infrastructure for designing textual inputs in metaverse avatar generation systems. A large dataset containing more than 100 billion tokens from Persian texts across news, books, articles, blogs, an‎d social media was first collected. Thedatathenunderwentamulti-stagecleaningprocess—includingnormalization, removal of struc tural noise such as symbols an‎d emojis, an‎d elimination of duplicate sentences—for language model training. A new tokenizer based on the Byte Pair Encoding algorithm with a 50k vocabulary size was developed an‎d trained to optimally segment words. The encoder model was trained in three separate phases with progressively increasing input sequence lengths from 512 to 1024 an‎d even tually to 8192 tokens, enabling the processing of long documents. eva‎luation results on natural language processing tasks—including sentiment analysis, text classification, named entity recogni tion, natural language inference, part-of-speech tagging, an‎d question answering—showed that the developed model achieves competitive performance in most tasks an‎d provides up to a 6% improve ment in tasks such as text classification an‎d sentiment analysis. Furthermore, the increased effective context length significantly improved the model’s ability to process long documents. This study rep resents the first attempt to train a Persian model with a state-of-the-art architecture an‎d can serve as a foundational platform for developing encoder-based applications such as semantic search engines, text classification an‎d analysis models, avatar generation systems, an‎d information retrieva‎l tools. Finally, for future research directions, the development of mixture-of-experts architectural variants, expansion of domain-specific datasets, an‎d enhancement of eva‎luation using stan‎dardized Persian benchmarks are proposed.

كليدواژه هاي فارسي

مدل‌هاي رمزگذار , پيكره متني فارسي , پردازش زبان طبيعي

كليدواژه هاي لاتين

Encoder-only , Persian Corpus , Natural Language Processing

Author

Mahmood Kalantari khalil abad

SuperVisor

Nasser Mozayani

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=34810&Field=0&DTC=6