Triggered Attention for End-to-End Speech Recognition

A new system architecture for end-to-end automatic speech recognition (ASR) is proposed that combines the alignment capabilities of the connectionist temporal classification (CTC) approach and the modeling strength of the attention mechanism. The proposed system architecture, named triggered attention (TA), uses a CTC-based classifier to control the activation of an attention-based decoder neural network. This allows for a frame-synchronous decoding scheme with an adjustable look-ahead parameter to control the induced delay and opens the door to streaming recognition with attention-based end-to-end ASR systems. We present ASR results of the TA model on three data sets of different size and language and compare the scores to a well-tuned attention-based end-to-end ASR baseline system, which consumes input frames in the traditional full-sequence manner. The proposed triggered attention (TA) decoder concept achieves similar or better ASR results in all experiments compared to the full-sequence attention model, while also limiting the decoding delay to two look-ahead frames, which in our setup corresponds to an output delay of 80 ms.