TR2018-001

End-to-End Multi-Speaker Speech Recognition

- Settle, S., Le Roux, J., Hori, T., Watanabe, S., Hershey, J.R., "End-to-End Multi-Speaker Speech Recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP.2018.8461893, April 2018, pp. 4819-4823.
  BibTeX TR2018-001 PDF Video
  - @inproceedings{Settle2018apr,
  - author = {Settle, Shane and {Le Roux}, Jonathan and Hori, Takaaki and Watanabe, Shinji and Hershey, John R.},
  - title = {{End-to-End Multi-Speaker Speech Recognition}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2018,
  - pages = {4819--4823},
  - month = apr,
  - doi = {10.1109/ICASSP.2018.8461893},
  - url = {https://www.merl.com/publications/TR2018-001}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

Current advances in deep learning have resulted in a convergence of methods across a wide range of tasks, opening the door for tighter integration of modules that were previously developed and optimized in isolation. Recent ground-breaking works have produced end-to-end deep network methods for both speech separation and end-to-end automatic speech recognition (ASR). Speech separation methods such as deep clustering address the challenging cocktail-party problem of distinguishing multiple simultaneous speech signals. This is an enabling technology for real-world human machine interaction (HMI). However, speech separation requires ASR to interpret the speech for any HMI task. Likewise, ASR requires speech separation to work in an unconstrained environment. Although these two components can be trained in isolation and connected after the fact, this paradigm is likely to be sub-optimal, since it relies on artificially mixed data. In this paper, we develop the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals. The joint training framework synergistically adapts the separation and recognition to each other. As an additional benefit, it enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.

Related News & Events

NEWS MERL's seamless speech recognition technology featured in Mitsubishi Electric Corporation press release
Date: February 13, 2019
Where: Tokyo, Japan
MERL Contacts: Jonathan Le Roux; Gordon Wichern
Research Area: Speech & Audio
Brief
- Mitsubishi Electric Corporation announced that it has developed the world's first technology capable of highly accurate multilingual speech recognition without being informed which language is being spoken. The novel technology, Seamless Speech Recognition, incorporates Mitsubishi Electric's proprietary Maisart compact AI technology and is built on a single system that can simultaneously identify and understand spoken languages. In tests involving 5 languages, the system achieved recognition with over 90 percent accuracy, without being informed which language was being spoken. When incorporating 5 more languages with lower resources, accuracy remained above 80 percent. The technology can also understand multiple people speaking either the same or different languages simultaneously. A live demonstration involving a multilingual airport guidance system took place on February 13 in Tokyo, Japan. It was widely covered by the Japanese media, with reports by all six main Japanese TV stations and multiple articles in print and online newspapers, including in Japan's top newspaper, Asahi Shimbun. The technology is based on recent research by MERL's Speech and Audio team.
  
  Link:
  
  Mitsubishi Electric Corporation Press Release
  
  Media Coverage:
  
  NHK, News (Japanese)
  NHK World, News (English), video report (starting at 4'38")
  TV Asahi, ANN news (Japanese)
  Nippon TV, News24 (Japanese)
  Fuji TV, Prime News Alpha (Japanese)
  TV Tokyo, World Business Satellite (Japanese)
  TV Tokyo, Morning Satellite (Japanese)
  TBS, News, N Studio (Japanese)
  The Asahi Shimbun (Japanese)
  The Nikkei Shimbun (Japanese)
  Nikkei xTech (Japanese)
  Response (Japanese).
NEWS MERL presenting 9 papers at ICASSP 2018
Date: April 15, 2018 - April 20, 2018
Where: Calgary, AB
MERL Contacts: Petros T. Boufounos; Toshiaki Koike-Akino; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Philip V. Orlik; Pu (Perry) Wang
Research Areas: Computational Sensing, Digital Video, Speech & Audio
Brief
- MERL researchers are presenting 9 papers at the IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which is being held in Calgary from April 15-20, 2018. Topics to be presented include recent advances in speech recognition, audio processing, and computational sensing. MERL is also a sponsor of the conference.
  
  ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year.

Related Research Highlights

Seamless Speech Recognition

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Link:

Media Coverage:

Jonathan
Le Roux