TR2017-016

Joint CTC- Attention Based End-to-End Speech Recognition Using Multi-task Learning

- Kim, S., Hori, T., Watanabe, S., "Joint CTC- Attention Based End-to-End Speech Recognition Using Multi-task Learning", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2017.
  BibTeX TR2017-016 PDF Video
  - @inproceedings{Kim2017mar,
  - author = {Kim, Suyoun and Hori, Takaaki and Watanabe, Shinji},
  - title = {{Joint CTC- Attention Based End-to-End Speech Recognition Using Multi-task Learning}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2017,
  - month = mar,
  - url = {https://www.merl.com/publications/TR2017-016}
  - }
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoderdecoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention model severely degraded especially in noisy condition and is hard to learn in the initial training stage with long input sequences, as compared with CTC. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4 - 14.6% relative improvements in Character Error Rate (CER).

Related News & Events

NEWS MERL's seamless speech recognition technology featured in Mitsubishi Electric Corporation press release
Date: February 13, 2019
Where: Tokyo, Japan
MERL Contacts: Jonathan Le Roux; Gordon Wichern
Research Area: Speech & Audio
Brief
- Mitsubishi Electric Corporation announced that it has developed the world's first technology capable of highly accurate multilingual speech recognition without being informed which language is being spoken. The novel technology, Seamless Speech Recognition, incorporates Mitsubishi Electric's proprietary Maisart compact AI technology and is built on a single system that can simultaneously identify and understand spoken languages. In tests involving 5 languages, the system achieved recognition with over 90 percent accuracy, without being informed which language was being spoken. When incorporating 5 more languages with lower resources, accuracy remained above 80 percent. The technology can also understand multiple people speaking either the same or different languages simultaneously. A live demonstration involving a multilingual airport guidance system took place on February 13 in Tokyo, Japan. It was widely covered by the Japanese media, with reports by all six main Japanese TV stations and multiple articles in print and online newspapers, including in Japan's top newspaper, Asahi Shimbun. The technology is based on recent research by MERL's Speech and Audio team.
  
  Link:
  
  Mitsubishi Electric Corporation Press Release
  
  Media Coverage:
  
  NHK, News (Japanese)
  NHK World, News (English), video report (starting at 4'38")
  TV Asahi, ANN news (Japanese)
  Nippon TV, News24 (Japanese)
  Fuji TV, Prime News Alpha (Japanese)
  TV Tokyo, World Business Satellite (Japanese)
  TV Tokyo, Morning Satellite (Japanese)
  TBS, News, N Studio (Japanese)
  The Asahi Shimbun (Japanese)
  The Nikkei Shimbun (Japanese)
  Nikkei xTech (Japanese)
  Response (Japanese).
NEWS MERL to present 10 papers at ICASSP 2017
Date: March 5, 2017 - March 9, 2017
Where: New Orleans
MERL Contacts: Petros T. Boufounos; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Anthony Vetro; Ye Wang
Research Areas: Computer Vision, Computational Sensing, Digital Video, Information Security, Speech & Audio
Brief
- MERL researchers will presented 10 papers at the upcoming IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), to be held in New Orleans from March 5-9, 2017. Topics to be presented include recent advances in speech recognition and audio processing; graph signal processing; computational imaging; and privacy-preserving data analysis.
  
  ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year.

Related Research Highlights

Seamless Speech Recognition

Research Areas:

Abstract:

Link:

Media Coverage: