چكيده به لاتين
Today, automatic speech recognition systems based on deep learning methods have made significant advancements over traditional approaches. However, the deep neural network requires a great deal of data to perform well. Additionally, most languages do not have an adequate amount of training datasets. Meanwhile, acquiring a suitable dataset is often difficult or impossible. Data augmentation is one of the proposed methods for solving this problem.
In this study, data augmentation methods have been evaluated to see how they affect the training and performance of an end-to-end Wav2Vec2 model (without the use of a language model). Our first step was to categorize the data augmentation methods into two groups: 1) data augmentation based on improving raw datasets, and 2) augmenting data in the feature space. Thereafter, the performance of the model has been evaluated in each category by using data augmentation methods in the domains of time, frequency, and both. It should be noted that in this study, 30% of the training data of the TIMIT dataset have been used to train the model (approximately about 70 minutes of labeled data).
In our experiments, we have found that regardless of which method we used, data augmentation significantly improves the performance of the ASR model in letter-level speech recognition. Interestingly, the best character-level speech recognition performance has been achieved when data were amplified in the frequency domain. However, these types of data augmentation methods were not very successful in improving the performance of the model in word-level speech recognition. In contrast, data augmentation methods in the time domain enabled the model to learn the linguistic features implicitly. As a result, the performance of the model was improved in speech recognition at the word level.
According to the experiments, the model performed best when the training data was augmented in both temporal and frequency dimensions. The WER, in this case, decreased from 25.9% to 23.7% and the accuracy increased by 2.2%, compared to the model learned from data without augmentation.
In the end, using selective data augmentation methods we trained the baseline model on all TIMIT data (in fact, the size of the training dataset has more than tripled). The results obtained in this step confirmed our findings in the previous step. Indeed, we obtained the best results when the raw speech data was augmented in both temporal and frequency domains in the feature space. The WER decreased from 19.3% to 18.7%, and outperformed the QCNN model, which ranked 19th in the TIMIT index with a WER of 19.64%.