علي لشيني

عنوان

طراحي سيستم تشخيص گفتار سمعي-بصري با استفاده از تقطير دانش

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندرسي كامپيوتر - هوش مصنوعي و رباتيكز

سال تحصيل

1399

تاريخ دفاع

1402/6/30

استاد راهنما

دكتر ناصر مزيني

دانشكده

مهندسي كامپيوتر

چكيده

تشخيص گفتار، يك حوزه كليدي در هوش مصنوعي است كه از آغاز توجه زيادي را به خود جلب كرده است. پيشرفت‌هاي اين حوزه نيز به تحكيم جايگاهشان در زندگي انسان ها كمك كرده است. از دستيارهاي صوتي در تلفن‌ها و خودروها گرفته تا سيستم‌هاي تبديل گفتار به متن به جزء اساسي از زندگي امروز ما تبديل شده‌اند. با اين حال، مشكلاتي همچون تشخيص دقيق در محيط‌هاي پرسروصدا و داراي نويز هنوز حل نشده‌اند. براي غلبه بر اين محدوديت‌ها، سيستم‌هاي تشخيص گفتار سمعي-بصري، با تركيب ويژگي‌هاي لب و چهره با صدا، جهت بهبود تشخيص گفتار ارائه شده‌اند. تقطير دانش، يك روش آموزش مدل‌هاي هوش مصنوعي است كه به انتقال دانش از مدل‌هاي بزرگتر به مدل‌هاي كوچكتر مي‌پردازد. هدف اين روش اين است كه مدل كوچكتر، تا حد ممكن توانمندي نزديك به مدل بزرگتر را بياموزد. در اين روش به مدل بزرگ معلم و به مدل كوچك دانش‌آموز گفته مي‌شود. اگرچه در ابتدا تقطير دانش براي آموزش مدل‌هاي با توانمندي محاسباتي كمتر مطرح شد، اما به دليل پتانسيل آن، در برخي موارد به عنوان جايگزيني براي آموزش بهتر شبكه‌هاي عصبي مورد استفاده قرار مي‌گيرد. در اين تحقيق برآنيم با استفاده از تقطير دانش، شبكه تشخيص گفتار سمعي-بصري را آموزش دهيم كه توانايي مقاومت بهتري در برابر داده هاي سمعي و بصري داراي نويز داشته باشد. بدين سان، از روشي به نام تقطير دانش مبني بر تحريف داده استفاده كرده‌ايم. در اين روش، ابتدا داده ها يك بار به طور بدون تغيير و يك بار با اعمال تغييراتي به صورت نويز يا روش هاي افزايشي به شبكه داده مي شود و در نهايت اختلاف خروجي شبكه براي هر دو داده از هم گرفته شده و به صورت ضرر به شبكه بازگردانده مي شود. شبكه پيشنهاد شده از مدل از پيش آموزش داده شده AV-HuBERT براي استخراج ويژگي ها سمعي-بصري و از دو لايه شبكه LSTM براي پردازش اين ويژگي ها استفاده مي كند، ويژگي هاي پردازش شده به دو لايه شبكه كانولوشني براي تركيب و توليد خروجي داده مي‌ شود. استفاده از يك شبكه باعث كاهش زمان آموزش شده و از بين رفتن الزام به استفاده از شبكه بزرگ تر شده است. روش پيشنهاد شده بر روي داده هاي ارزيابي داراي خطاي سطح كلمه 28.35 و خطاي سطح كاراكتر 10.38 درصد مي باشد كه به ترتيب حدود دو و يك درصد بهبود نسبت به مدل پايه داده است. اما در زمان اعمال نويز هاي محيطي، مدل پيشنهاد شده نسبت به مدل پايه بهبود چشم‌گيري داشته است و در سطح خطاي كلمه، در بهترين حالت حدود 5 درصد و در سطح كاراكتر حدود 11 درصد بهبود داشته است.

تاريخ ورود اطلاعات

1402/09/12

عنوان به انگليسي

Design of audio-visual speech recognition system using knowledge distillation

تاريخ بهره برداري

1/1/1900 12:00:00 AM

دانشجوي وارد كننده اطلاعات

علي لشيني

Name: علي لشيني
Author: علي لشيني

چكيده به لاتين

Speech recognition, a crucial domain in artificial intelligence, has long captivated researchers' attention. The remarkable advancements in this field swiftly integrated voice-based systems into our daily lives. From voice assistants on mobile devices and automobiles to speech-to-text applications, these technologies have become indispensable in today's society. Nevertheless, persistent challenges persist within these systems, particularly in accurately deciphering speech amidst noisy environments. To surmount these limitations, researchers have turned to audio-visual speech recognition systems. By leveraging lip movements, facial expressions, and the speaker's voice, these systems aim to enhance speech recognition capabilities. Knowledge distillation is an artificial intelligence technique used to train models by leveraging the knowledge of other models. Initially developed to transfer knowledge from a larger model to a smaller one, its purpose is to equip the smaller model with the capability to achieve similar recognition performance as its larger counterpart. In this context, the larger model is referred to as the teacher, while the smaller model is known as the student. Although knowledge distillation was initially proposed to enhance training efficiency for models with limited computational resources, its potential has led to its adoption as an alternative method for improved and optimized training of neural networks. In our research, our objective is to employ knowledge distillation to train an audio-visual speech recognition network with enhanced robustness to noisy audio-visual data. To achieve this, we have employed a technique known as knowledge distillation based on data distortion. This approach involves presenting the network with both unchanged data and data that has been altered through the introduction of noise or augmentational methods. Subsequently, the discrepancies in the network's output for both sets of data are isolated and utilized as a loss, which is then fed back into the network. Notably, this method utilizes a single network and does not require prior training of a larger network. As a result, training time is reduced, and the necessity of employing a larger network is eliminated. The proposed method has demonstrated promising results in the eva‎luation phase, achieving a word error rate of 28.35 percent and a character error rate of 10.38 percent on the eva‎luation data. Notably, it outperforms the basic model by approximately two percent at the word level and one percent at the character level. However, the true strength of the proposed model lies in its ability to handle environmental noise, which is the primary focus of this research. In this regard, the proposed model exhibits substantial improvements compared to the basic model. In the best-case scenario, it achieves a 5 percent reduction in word error rate and an impressive 11 percent reduction in character error rate when eva‎luated in the presence of environmental noise. These findings highlight the significant advancements and enhanced performance of the proposed model in real-world, noisy conditions.

كليدواژه هاي فارسي

تشخيص گفتار سمعي-بصري , تقطير دانش , تشخيص گفتار , يادگيري عميق

كليدواژه هاي لاتين

Audio-visual speech recognition , Knowledge distillation , Speech recognition , Deep learning

Author

Ali Lashini

SuperVisor

Naser Mozayani

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=30155&Field=0&DTC=6