Seamless Speech Recognition

A new multilingual speech recognition technology that simultaneously identifies the language spoken and recognizes the words.

Takaaki Hori, Jonathan Le Roux, Bret Harsham, Niko Moritz, Gordon Wichern
Joint work with: Hiroshi Seki (Toyohashi University of Technology)

Search MERL publications by keyword: Speech & Audio, ASR, deep learning.


This research tackles the challenging task of multilingual multi-speaker automatic speech recognition (ASR) using an all-in-one end-to-end system. Several multilingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework. There has also been growing interest in multi-speaker speech recognition, which enables generation of multiple label sequences from single-channel mixed speech. In particular, a multi-speaker end-to-end ASR system that can directly model one-to-many mappings without additional auxiliary clues was recently proposed.

We propose an unprecedented all-in-one end-to-end multilingual multi-speaker ASR system integrating end-to-end approaches for multilingual ASR and multi-speaker ASR. This system can be used to provide a seamless ASR experience, in particular improving accessibility of interfaces facing a diverse set of users.

As an example of potential application, we developed a live demonstration of a multilingual guidance system for an airport. The system realizes a speech interface that can guide users to various locations within an airport. It can recognize multilingual speech with code-switching and simultaneous speech by multiple speakers in various languages without prior language settings, and provide the appropriate guidance for each query in the corresponding language. The demonstration was presented during a press event in February 2019, in Tokyo, Japan. It was widely covered by the Japanese media, with reports by all six main Japanese TV stations and multiple articles in print and online newspapers, including in Japan\'s top newspaper, Asahi Shimbun.

Media Coverage:

Videos

Seamless Multilingual Speech Recognition Technology

References