TR2016-073

Single-Channel Multi-Speaker Separation using Deep Clustering


    •  Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R., "Single-Channel Multi-Speaker Separation using Deep Clustering", Interspeech, DOI: 10.21437/Interspeech.2016-1176, September 2016, pp. 545-549.
      BibTeX TR2016-073 PDF
      • @inproceedings{Isik2016sep,
      • author = {Isik, Yusuf and Le Roux, Jonathan and Chen, Zhuo and Watanabe, Shinji and Hershey, John R.},
      • title = {Single-Channel Multi-Speaker Separation using Deep Clustering},
      • booktitle = {Interspeech},
      • year = 2016,
      • pages = {545--549},
      • month = sep,
      • doi = {10.21437/Interspeech.2016-1176},
      • url = {https://www.merl.com/publications/TR2016-073}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

 

  • Related News & Events

  • Related Research Highlights