TR2018-104

A Purely End-to-end System for Multi-speaker Speech Recognition


    •  Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", Annual Meeting of the Association for Computational Linguistics (ACL), July 2018, pp. 2620-2630.
      BibTeX TR2018-104 PDF Video
      • @inproceedings{Seki2018jul,
      • author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan and Hershey, John},
      • title = {A Purely End-to-end System for Multi-speaker Speech Recognition},
      • booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)},
      • year = 2018,
      • pages = {2620--2630},
      • month = jul,
      • publisher = {Elsevier},
      • url = {https://www.merl.com/publications/TR2018-104}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-toend manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous endto-end works featuring explicit separation and recognition modules.

 

  • Related News & Events

  • Related Video

  • Related Research Highlights

  • Related Publication

  •  Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", arXiv, July 2018.
    BibTeX arXiv Video
    • @article{Seki2018jul2,
    • author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan and Hershey, John},
    • title = {A Purely End-to-end System for Multi-speaker Speech Recognition},
    • journal = {arXiv},
    • year = 2018,
    • month = jul,
    • url = {https://arxiv.org/abs/1805.05826}
    • }