End-to-end ASR without using morphological analyzer, pronunciation dictionary and language model

This paper introduces Japanese end-to-end ASR system based on a joint CTC/attention scheme [1], which is an extension of attention-based ASR [2] by using multi-task learning to incorporate the Connectionist Temporal Classification (CTC) objective. Unlike the conventional Japanese ASR systems based on DNN/HMM hybrid [3] or end-to-end systems with Japanese syllable characters (i.e., hiragana or katakana) [4], this method directly predicts a Japanese sentence based on a standard Japanese character set including Kanji, hiragana, and katakana characters, Roman/Greek alphabets, Arabic numbers, and so on. Thus, the method does not use any pronunciation dictionary, which requires hand-crafted work by human. In addition, since it's based on character based recognition, it does not require a morphological analyzer to chunk a character sequence to a word sequence. Finally, attention mechanism itself holds a language-model-like function in the decoder network, unlike a Japanese end-to-end system based on CTC [5]. Therefore, it does not require a separate language model module, which makes system construction and decoding process very simple.