چكيده به لاتين
ASR is one of the main research topics in speech processing and its main goal is to convert speech
signal to a sequence of words. Systems used in speech recognition have some components like
audio preprocessing, acoustic model, Lexical, language model, and decoder. In recent years there
was a great trend on new systems called End-to-End (E2E) systems. An E2E ASR systems convert a
feature sequence x to an output sequence of symbols probabilities(phonemes, characters or words).
One of the methods based on E2E systems trains a neural network with the Connectionist Temporal
Classification (CTC) loss function. Once the network is trained, the decoder must label the unknown
input sequence x by choosing the most probable labeling l
∗
. For decoding CTC networks the Bestpath decoding and Prefix Beam Search(PBS) are introduced as a baseline. In PBS we try to reduce
the error with the language model.
In this thesis, we corrected the BPS decoding and improved it by applying a penalty to unknown
or Out-Of-Vocabulary (OOV) words in the language model. Many times words that are mistakenly
recognized by speech recognizer are OOV’s. Thus applying the penalty to these words probability
will reduce the probability of the wrong guesses. In order to implement our idea, first, we need to
correct the PBS algorithm. Then with applying a penalty on OOV probability in decoding, we try
to get better results. Next, we try to tune the penalty by CTC output entropy. The proposed method
will improve the decoder’s output and reduce the error compared to the base PBS algorithm. In the
Librispeech dataset the Word Error Rate (WER) is reduced from 10.079 to 9.440 and in TED-LIUM
dataset is reduced from 28.894 to 28.172. The other evaluation term Character Error Rate(CER)
in Librispeech dataset is reduced from 2.735 to 2.669 and in TED-LIUM is reduced from 8.980 to
8.801.