محمدرضا عين اللهي

عنوان

بهبود عملكرد سيستم‌هاي بازشناسي گفتار در زبان‌هاي با منابع محدود با استفاده از روش‌هاي مبتني بر داده‌افزايي

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

هوش مصنوعي

سال تحصيل

1398

تاريخ دفاع

1401/03/23

استاد راهنما

احمد اكبري ازيراني

استاد مشاور

بابك ناصر شريف

دانشكده

مهندسي كامپيوتر

چكيده

امروزه توسعه سيستم‌هاي بازشناسي خودكار گفتار با استفاده از رويكرد‌هاي يادگيري عميق در مقايسه با روش‌هاي سنتي پيشرفت‌هاي قابل توجهي را نشان مي‌دهد. اين مدل‌ها براي آموزش به مقادير زيادي جفت داده گفتاري و متن زيرنويس نياز دارند. با اين وجود، اكثر زبان‌هاي زنده دنيا فاقد دادگان گفتاري مناسب براي آموزش مدل‌هاي صوتي هستند. همچنين جمع‌آوري دادگان مناسب كاري بسيار دشوار و بعضا غيرممكن است. يكي از روش‌هاي حل اين مسئله، توليد دادگان آموزشي مصنوعي با استفاده از روش‌هاي داده‌افزايي است. ما در اين پژوهش تأثيرات روش‌هاي داده‌افزايي، در بازتنظيم بلوك‌هاي رمزگذار مدل Wav2Vec2 پايه با استفاده از تابع هزينه CTC (و بدون استفاده از مدل زباني) براي ASR را بر روي سي درصد دادگان آموزشي TIMIT بررسي مي‌كنيم. ابتدا روش‌هاي داده‌افزايي به دو نوع روش‌هاي داده‌افزايي بر روي دادگان آموزشي و روش‌هاي داده‌افزايي در فضاي ويژگي تقسيم شده‌اند. سپس تأثيرات روش‌هاي داده‌افزايي بر عملكرد مدل براي هر دسته به‌صورت جداگانه، در فضاي زمان و فركانس بررسي شده است. نتايج آزمايشها نشان مي‌دهد كه تمامي روش‌هاي داده‌افزايي در بهبود عملكرد مدل ASR در بازشناسي گفتار در سطح كلمات موفق بوده‌اند. با اين حال زماني كه داده‌افزايي (در هر دو فضاي ويژگي و سطح دادگان آموزشي) در حوزه زمان يا زمان-فركانس اعمال مي‌شود، عملكرد مدل بسيار بهتر از زماني است كه عمل داده‌افزايي صرفا در حوزه فركانسي اعمال شود. در كار ما، بهترين عملكرد مدل زماني حاصل شده است كه داده‌افزايي در فضاي ويژگي و در حوزه زمان-فركانس اعمال گرديده است. در اين حالت، WER از 25.9٪ به 23.7٪ كاهش يافته است. پس از آن استفاده از روش‌هاي داده‌افزايي در فضاي ويژگي و رويكردهاي داده‌افزايي پوشش زمان-فركانس و كشش زماني مبتني بر افزايش دادگان به ترتيب و با عملكرد نسبت مشابه در رده‌هاي بعدي قرار دارند. در مقابل، ضعيف‌ترين عملكرد نيز متعلق به مدلي است كه با داده‌افزايي در حوزه فركانس در فضاي ويژگي آموزش ديده است. در ادامه با بازتنظيم بلوك‌هاي رمزگذار مدل پايه بر روي تمامي دادگان مجموعه داده TIMIT (بيش از 3 برابر دادۀ آموزشي) از صحت نتيجه اطمينان حاصل شده و عملكرد مدل با اعمال داده‌افزايي در حوزه زمان-فركانس در فضاي ويژگي نسبت به مدل پايه، WER از 19.3٪ به 18.7٪ كاهش يافته است. البته اين مدل و مدلي كه بدون داده‌افزايي آموزش ديده است، هر دو عملكرد بهتري از مدل QCNN كه در شاخص TIMIT با WER 19.64٪ در رده نوزدهم قرار دارد، بدست آورده‌اند.

تاريخ ورود اطلاعات

1401/07/30

عنوان به انگليسي

Improving the performance of speech recognition systems in low-resource languages using data augmentation

تاريخ بهره برداري

6/13/2023 12:00:00 AM

دانشجوي وارد كننده اطلاعات

محمدرضا عين اللهي

Name: محمدرضا عين اللهي
Author: محمدرضا عين اللهي

چكيده به لاتين

Today, automatic speech recognition systems based on deep learning methods have made significant advancements over traditional approaches. However, the deep neural network requires a great deal of data to perform well. Additionally, most languages do not have an adequate amount of training datasets. Meanwhile, acquiring a suitable dataset is often difficult or impossible. Data augmentation is one of the proposed methods for solving this problem. In this study, data augmentation methods have been eva‎luated to see how they affect the training and performance of an end-to-end Wav2Vec2 model (without the use of a language model). Our first step was to categorize the data augmentation methods into two groups: 1) data augmentation based on improving raw datasets, and 2) augmenting data in the feature space. Thereafter, the performance of the model has been eva‎luated in each category by using data augmentation methods in the domains of time, frequency, and both. It should be noted that in this study, 30% of the training data of the TIMIT dataset have been used to train the model (approximately about 70 minutes of labeled data). In our experiments, we have found that regardless of which method we used, data augmentation significantly improves the performance of the ASR model in letter-level speech recognition. Interestingly, the best character-level speech recognition performance has been achieved when data were amplified in the frequency domain. However, these types of data augmentation methods were not very successful in improving the performance of the model in word-level speech recognition. In contrast, data augmentation methods in the time domain enabled the model to learn the linguistic features implicitly. As a result, the performance of the model was improved in speech recognition at the word level. According to the experiments, the model performed best when the training data was augmented in both temporal and frequency dimensions. The WER, in this case, decreased from 25.9% to 23.7% and the accuracy increased by 2.2%, compared to the model learned from data without augmentation. In the end, using selective data augmentation methods we trained the baseline model on all TIMIT data (in fact, the size of the training dataset has more than tripled). The results obtained in this step confirmed our findings in the previous step. Indeed, we obtained the best results when the raw speech data was augmented in both temporal and frequency domains in the feature space. The WER decreased from 19.3% to 18.7%, and outperformed the QCNN model, which ranked 19th in the TIMIT index with a WER of 19.64%.

كليدواژه هاي فارسي

بازشناسي خودكار گفتار , منابع محدود , داده‌افزايي , يادگيري انتقالي , يادگيري بازنمايي

كليدواژه هاي لاتين

Speech Recognition , Low-resource , Data augmentation , Transfer learning , Representation Learning

Author

Mohammad-reza Einollahi

SuperVisor

Ahmad Akbari-Azirani

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=27194&Field=0&DTC=6