End-to-End Multilingual Multi-Speaker Speech Recognition

    •  Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "End-to-End Multilingual Multi-Speaker Speech Recognition", Interspeech, DOI: 10.21437/​Interspeech.2019-3038, September 2019, pp. 3755-3759.
      BibTeX TR2019-101 PDF
      • @inproceedings{Seki2019sep,
      • author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan and Hershey, John},
      • title = {End-to-End Multilingual Multi-Speaker Speech Recognition},
      • booktitle = {Interspeech},
      • year = 2019,
      • pages = {3755--3759},
      • month = sep,
      • doi = {10.21437/Interspeech.2019-3038},
      • url = {}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Machine Learning, Speech & Audio


The expressive power of end-to-end automatic speech recognition (ASR) systems enables direct estimation of a character or word label sequence from a sequence of acoustic features. Direct optimization of the whole system is advantageous because it not only eliminates the internal linkage necessary for hybrid systems, but also extends the scope of potential applications by training the model for various objectives. In this paper, we tackle the challenging task of multilingual multispeaker ASR using such an all-in-one end-to-end system. Several multilingual ASR systems were recently proposed based on a monolithic neural network architecture without languagedependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework. There has also been growing interest in multi-speaker speech recognition, which enables generation of multiple label sequences from single-channel mixed speech. In particular, a multi-speaker end-to-end ASR system that can directly model one-to-many mappings without additional auxiliary clues was recently proposed. The proposed model, which integrates the capabilities of these two systems, is evaluated using mixtures of two speakers generated by using 10 languages, including codeswitching utterances.


  • Related News & Events

    •  NEWS    MERL Speech & Audio Researchers Presenting 7 Papers and a Tutorial at Interspeech 2019
      Date: September 15, 2019 - September 19, 2019
      Where: Graz, Austria
      MERL Contacts: Chiori Hori; Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      • MERL Speech & Audio Team researchers will be presenting 7 papers at the 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019, which is being held in Graz, Austria from September 15-19, 2019. Topics to be presented include recent advances in end-to-end speech recognition, speech separation, and audio-visual scene-aware dialog. Takaaki Hori is also co-presenting a tutorial on end-to-end speech processing.

        Interspeech is the world's largest and most comprehensive conference on the science and technology of spoken language processing. It gathers around 2000 participants from all over the world.