TR2019-098

Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition



In hybrid automatic speech recognition (ASR) systems, neural networks are used as acoustic models (AMs) to recognize phonemes that are composed to words and sentences using pronunciation dictionaries, hidden Markov models, and language models, which can be jointly represented by a weighted finite state transducer (WFST). The importance of capturing temporal context by an AM has been studied and discussed in prior work. In an end-to-end ASR system, however, all components are merged into a single neural network, i.e., the breakdown into an AM and the different parts of the WFST model is no longer possible. This implies that end-to-end neural network architectures have even stronger requirements for processing long contextual information. Bidirectional long short-term memory (BLSTM) neural networks have demonstrated state-of-the-art results in end-to-end ASR but are unsuitable for streaming applications. Latency-controlled BLSTMs account for this by limiting the future context seen by the backward directed recurrence using chunk-wise processing. In this paper, we propose two new unidirectional neural network architectures, the timedelay LSTM (TDLSTM) and the parallel time-delayed LSTM (PTDLSTM) streams, which both limit the processing latency to a fixed size and demonstrate significant improvements compared to prior art on a variety of ASR tasks.