TR2018-104

A Purely End-to-end System for Multi-speaker Speech Recognition

- Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", Annual Meeting of the Association for Computational Linguistics (ACL), July 2018, pp. 2620-2630.
  BibTeX TR2018-104 PDF Video
  - @inproceedings{Seki2018jul,
  - author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan and Hershey, John},
  - title = {A Purely End-to-end System for Multi-speaker Speech Recognition},
  - booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)},
  - year = 2018,
  - pages = {2620--2630},
  - month = jul,
  - publisher = {Elsevier},
  - url = {https://www.merl.com/publications/TR2018-104}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-toend manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous endto-end works featuring explicit separation and recognition modules.

Related News & Events

NEWS MERL's seamless speech recognition technology featured in Mitsubishi Electric Corporation press release
Date: February 13, 2019
Where: Tokyo, Japan
MERL Contacts: Jonathan Le Roux; Gordon Wichern
Research Area: Speech & Audio
Brief
- Mitsubishi Electric Corporation announced that it has developed the world's first technology capable of highly accurate multilingual speech recognition without being informed which language is being spoken. The novel technology, Seamless Speech Recognition, incorporates Mitsubishi Electric's proprietary Maisart compact AI technology and is built on a single system that can simultaneously identify and understand spoken languages. In tests involving 5 languages, the system achieved recognition with over 90 percent accuracy, without being informed which language was being spoken. When incorporating 5 more languages with lower resources, accuracy remained above 80 percent. The technology can also understand multiple people speaking either the same or different languages simultaneously. A live demonstration involving a multilingual airport guidance system took place on February 13 in Tokyo, Japan. It was widely covered by the Japanese media, with reports by all six main Japanese TV stations and multiple articles in print and online newspapers, including in Japan's top newspaper, Asahi Shimbun. The technology is based on recent research by MERL's Speech and Audio team.
  
  Link:
  
  Mitsubishi Electric Corporation Press Release
  
  Media Coverage:
  
  NHK, News (Japanese)
  NHK World, News (English), video report (starting at 4'38")
  TV Asahi, ANN news (Japanese)
  Nippon TV, News24 (Japanese)
  Fuji TV, Prime News Alpha (Japanese)
  TV Tokyo, World Business Satellite (Japanese)
  TV Tokyo, Morning Satellite (Japanese)
  TBS, News, N Studio (Japanese)
  The Asahi Shimbun (Japanese)
  The Nikkei Shimbun (Japanese)
  Nikkei xTech (Japanese)
  Response (Japanese).

Related Research Highlights

Seamless Speech Recognition

Related Publication

Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", arXiv, July 2018.

BibTeX arXiv Video

@article{Seki2018jul2,
author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan and Hershey, John},
title = {A Purely End-to-end System for Multi-speaker Speech Recognition},
journal = {arXiv},
year = 2018,
month = jul,
url = {https://arxiv.org/abs/1805.05826}
}

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Link:

Media Coverage:

Jonathan
Le Roux