چكيده به لاتين
Speech recognition, a crucial domain in artificial intelligence, has long captivated researchers' attention. The remarkable advancements in this field swiftly integrated voice-based systems into our daily lives. From voice assistants on mobile devices and automobiles to speech-to-text applications, these technologies have become indispensable in today's society. Nevertheless, persistent challenges persist within these systems, particularly in accurately deciphering speech amidst noisy environments. To surmount these limitations, researchers have turned to audio-visual speech recognition systems. By leveraging lip movements, facial expressions, and the speaker's voice, these systems aim to enhance speech recognition capabilities.
Knowledge distillation is an artificial intelligence technique used to train models by leveraging the knowledge of other models. Initially developed to transfer knowledge from a larger model to a smaller one, its purpose is to equip the smaller model with the capability to achieve similar recognition performance as its larger counterpart. In this context, the larger model is referred to as the teacher, while the smaller model is known as the student. Although knowledge distillation was initially proposed to enhance training efficiency for models with limited computational resources, its potential has led to its adoption as an alternative method for improved and optimized training of neural networks.
In our research, our objective is to employ knowledge distillation to train an audio-visual speech recognition network with enhanced robustness to noisy audio-visual data. To achieve this, we have employed a technique known as knowledge distillation based on data distortion. This approach involves presenting the network with both unchanged data and data that has been altered through the introduction of noise or augmentational methods. Subsequently, the discrepancies in the network's output for both sets of data are isolated and utilized as a loss, which is then fed back into the network. Notably, this method utilizes a single network and does not require prior training of a larger network. As a result, training time is reduced, and the necessity of employing a larger network is eliminated.
The proposed method has demonstrated promising results in the evaluation phase, achieving a word error rate of 28.35 percent and a character error rate of 10.38 percent on the evaluation data. Notably, it outperforms the basic model by approximately two percent at the word level and one percent at the character level. However, the true strength of the proposed model lies in its ability to handle environmental noise, which is the primary focus of this research. In this regard, the proposed model exhibits substantial improvements compared to the basic model. In the best-case scenario, it achieves a 5 percent reduction in word error rate and an impressive 11 percent reduction in character error rate when evaluated in the presence of environmental noise. These findings highlight the significant advancements and enhanced performance of the proposed model in real-world, noisy conditions.