TR2017-132

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM


    •  Hori, T., Watanabe, S., Zhang, Y., Chan, W., "Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM", Interspeech, August 2017.
      BibTeX Download PDF
      • @inproceedings{Hori2017aug,
      • author = {Hori, Takaaki and Watanabe, Shinji and Zhang, Yu and Chan, William},
      • title = {Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM},
      • booktitle = {Interspeech},
      • year = 2017,
      • month = aug,
      • url = {https://www.merl.com/publications/TR2017-132}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Speech & Audio

  • Research Highlights:
    Research Highlight
    Seamless Speech Recognition

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.