TR2017-190

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

- Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T., "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition", IEEE Journal of Selected Topics in Signal Processing, DOI: 10.1109/JSTSP.2017.2763455, Vol. 11, No. 8, pp. 1240-1253, October 2017.
  BibTeX TR2017-190 PDF Video
  - @article{Watanabe2017oct,
  - author = {Watanabe, Shinji and Hori, Takaaki and Kim, Suyoun and Hershey, John R. and Hayashi, Tomoki},
  - title = {Hybrid CTC/Attention Architecture for End-to-End Speech Recognition},
  - journal = {IEEE Journal of Selected Topics in Signal Processing},
  - year = 2017,
  - volume = 11,
  - number = 8,
  - pages = {1240--1253},
  - month = oct,
  - doi = {10.1109/JSTSP.2017.2763455},
  - issn = {1941-0484},
  - url = {https://www.merl.com/publications/TR2017-190}
  - }
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of endto-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attentionbased and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder- decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

Related News & Events

NEWS MERL's seamless speech recognition technology featured in Mitsubishi Electric Corporation press release
Date: February 13, 2019
Where: Tokyo, Japan
MERL Contacts: Jonathan Le Roux; Gordon Wichern
Research Area: Speech & Audio
Brief
- Mitsubishi Electric Corporation announced that it has developed the world's first technology capable of highly accurate multilingual speech recognition without being informed which language is being spoken. The novel technology, Seamless Speech Recognition, incorporates Mitsubishi Electric's proprietary Maisart compact AI technology and is built on a single system that can simultaneously identify and understand spoken languages. In tests involving 5 languages, the system achieved recognition with over 90 percent accuracy, without being informed which language was being spoken. When incorporating 5 more languages with lower resources, accuracy remained above 80 percent. The technology can also understand multiple people speaking either the same or different languages simultaneously. A live demonstration involving a multilingual airport guidance system took place on February 13 in Tokyo, Japan. It was widely covered by the Japanese media, with reports by all six main Japanese TV stations and multiple articles in print and online newspapers, including in Japan's top newspaper, Asahi Shimbun. The technology is based on recent research by MERL's Speech and Audio team.
  
  Link:
  
  Mitsubishi Electric Corporation Press Release
  
  Media Coverage:
  
  NHK, News (Japanese)
  NHK World, News (English), video report (starting at 4'38")
  TV Asahi, ANN news (Japanese)
  Nippon TV, News24 (Japanese)
  Fuji TV, Prime News Alpha (Japanese)
  TV Tokyo, World Business Satellite (Japanese)
  TV Tokyo, Morning Satellite (Japanese)
  TBS, News, N Studio (Japanese)
  The Asahi Shimbun (Japanese)
  The Nikkei Shimbun (Japanese)
  Nikkei xTech (Japanese)
  Response (Japanese).

Related Research Highlights

Seamless Speech Recognition

Research Areas:

Abstract:

Link:

Media Coverage: