TR2016-073

Single-Channel Multi-Speaker Separation using Deep Clustering

- Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R., "Single-Channel Multi-Speaker Separation using Deep Clustering", Interspeech, DOI: 10.21437/Interspeech.2016-1176, September 2016, pp. 545-549.
  BibTeX TR2016-073 PDF
  - @inproceedings{Isik2016sep,
  - author = {Isik, Yusuf and Le Roux, Jonathan and Chen, Zhuo and Watanabe, Shinji and Hershey, John R.},
  - title = {Single-Channel Multi-Speaker Separation using Deep Clustering},
  - booktitle = {Interspeech},
  - year = 2016,
  - pages = {545--549},
  - month = sep,
  - doi = {10.21437/Interspeech.2016-1176},
  - url = {https://www.merl.com/publications/TR2016-073}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

Related News & Events

NEWS MERL's speech research featured in NPR's All Things Considered
Date: February 5, 2018
Where: National Public Radio (NPR)
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- MERL's speech separation technology was featured in NPR's All Things Considered, as part of an episode of All Tech Considered on artificial intelligence, "Can Computers Learn Like Humans?". An example separating the overlapped speech of two of the show's hosts was played on the air.
  The technology is based on a proprietary deep learning method called Deep Clustering. It is the world's first technology that separates in real time the simultaneous speech of multiple unknown speakers recorded with a single microphone. It is a key step towards building machines that can interact in noisy environments, in the same way that humans can have meaningful conversations in the presence of many other conversations.
  A live demonstration was featured in Mitsubishi Electric Corporation's Annual R&D Open House last year, and was also covered in international media at the time.
  
  (Photo credit: Sam Rowe for NPR)
  
  Link:
  "Can Computers Learn Like Humans?" (NPR, All Things Considered)
  MERL Deep Clustering Demo.
NEWS MERL's breakthrough speech separation technology featured in Mitsubishi Electric Corporation's Annual R&D Open House
Date: May 24, 2017
Where: Tokyo, Japan
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- Mitsubishi Electric Corporation announced that it has created the world's first technology that separates in real time the simultaneous speech of multiple unknown speakers recorded with a single microphone. It's a key step towards building machines that can interact in noisy environments, in the same way that humans can have meaningful conversations in the presence of many other conversations. In tests, the simultaneous speeches of two and three people were separated with up to 90 and 80 percent accuracy, respectively. The novel technology, which was realized with Mitsubishi Electric's proprietary "Deep Clustering" method based on artificial intelligence (AI), is expected to contribute to more intelligible voice communications and more accurate automatic speech recognition. A characteristic feature of this approach is its versatility, in the sense that voices can be separated regardless of their language or the gender of the speakers. A live speech separation demonstration that took place on May 24 in Tokyo, Japan, was widely covered by the Japanese media, with reports by three of the main Japanese TV stations and multiple articles in print and online newspapers. The technology is based on recent research by MERL's Speech and Audio team.
  
  Links:
  Mitsubishi Electric Corporation Press Release
  MERL Deep Clustering Demo
  
  Media Coverage:
  
  Fuji TV, News, "Minna no Mirai" (Japanese)
  The Nikkei (Japanese)
  Nikkei Technology Online (Japanese)
  Sankei Biz (Japanese)
  EE Times Japan (Japanese)
  ITpro (Japanese)
  Nikkan Sports (Japanese)
  Nikkan Kogyo Shimbun (Japanese)
  Dempa Shimbun (Japanese)
  Il Sole 24 Ore (Italian)
  IEEE Spectrum (English).

Related Research Highlights

Deep Clustering

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Link:

Links:

Media Coverage:

Jonathan
Le Roux