Single-Channel Multi-Speaker Separation using Deep Clustering

Training deep discriminative embeddings to solve the cocktail party problem.

MERL Researchers: John R. Hershey, Jonathan Le Roux, Shinji Watanabe, Bret Harsham (Speech & Audio).
Joint work with: Yusuf Isik (Sabanci University, TUBITAK BILGEM) and Zhuo Chen (Columbia University).

Search MERL publications by keyword: Speech & Audio, separation, deep learning.

The human auditory system gives us the extraordinary ability to converse in the midst of a noisy throng of party goers. Solving this so-called cocktail party problem has proven extremely challenging for computers, and separating and recognizing speech in such conditions has been the holy grail of speech processing for more than 50 years. Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering, producing unprecedented speaker-independent single-channel separation performance on two-speaker and three-speaker mixtures.

In our recently proposed deep clustering framework [Hershey et al., ICASSP 2016], a neural network is trained to assign an embedding vector to each element of a multi-dimensional signal, such that clustering the embeddings yields a desired segmentation of the signal. In the cocktail-party problem, the embeddings are assigned to each time-frequency (TF) index of the short-time Fourier transform (STFT) of the mixture of speech signals. Clustering these embeddings yields an assignment of each TF bin to one of the inferred sources. These assignments are used as a masking function to extract the dominant parts of each source.

Preliminary work on this method produced remarkable performance, improving SNR by 6 dB on the task of separating two unknown speakers from a single-channel mixture [Hershey et al., ICASSP 2016]. In [Isik et al., Interspeech 2016], we presented improvements and extensions that enabled a leap forward in separation quality, reaching levels of improvement -- over 10 dB SNR gain -- that were previously unobtainable even in simpler speech enhancement tasks. In particular, we showed how the model could be trained in an end-to-end fashion, optimizing the whole architecture for best signal quality. We also evaluated our method using automatic speech recognition (ASR), and showed that it can reduce the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

In May 2017, our research was featured in a Mitsubishi Electric Corporation Press Release, and we presented a live demonstration of our speech separation system at the company's annual R&D event in Tokyo, Japan. The demonstration was widely covered by the media, with reports by three of the main Japanese TV stations and multiple articles in print and online newspapers.

Media Coverage

TBS, News, N Studio (Japanese)
Fuji TV, News, "Minna no Mirai" (Japanese)
IEEE Spectrum (English)
The Nikkei (Japanese)
Nikkei Technology Online (Japanese)
Sankei Biz (Japanese)
EE Times Japan (Japanese)
ITpro (Japanese)
Nikkan Sports (Japanese)
Nikkan Kogyo Shimbun (Japanese)
Dempa Shimbun (Japanese)
Il Sole 24 Ore (Italian)

Video

Single Channel Multi Speaker Separation Using Deep Clustering

PowerPoint Demo

Deep_Clustering_Demo.pptx

Audio Samples

Female-female mixture, Original
[ wav mp3 ogg ]

Female-female mixture, speaker 1, CASA
[ wav mp3 ogg ]

Female-female mixture, speaker 2, CASA
[ wav mp3 ogg ]

Female-female mixture, speaker 1, deep clustering
[ wav mp3 ogg ]

Female-female mixture, speaker 2, deep clustering
[ wav mp3 ogg ]

Male-male mixture, Original
[ wav mp3 ogg ]

Male-male mixture, speaker 1, CASA
[ wav mp3 ogg ]

Male-male mixture, speaker 2, CASA
[ wav mp3 ogg ]

Male-male mixture, speaker 1, deep clustering
[ wav mp3 ogg ]

Male-male mixture, speaker 2, deep clustering
[ wav mp3 ogg ]

Software & Data Downloads

Scripts to generate the wsj0-mix multi-speaker dataset

create-speaker-mixtures.zip

spatialize_wsj0-mix.zip

MERL Publications

Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S., "Deep Clustering: Discriminative Embeddings for Segmentation and Separation", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP.2016.7471631, March 2016, pp. 31-35.
BibTeX TR2016-003 PDF
- @inproceedings{Hershey2016mar,
- author = {Hershey, John R. and Chen, Zhuo and {Le Roux}, Jonathan and Watanabe, Shinji},
- title = {{Deep Clustering: Discriminative Embeddings for Segmentation and Separation}},
- booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
- year = 2016,
- pages = {31--35},
- month = mar,
- doi = {10.1109/ICASSP.2016.7471631},
- url = {https://www.merl.com/publications/TR2016-003}
- }
Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R., "Single-Channel Multi-Speaker Separation using Deep Clustering", Interspeech, DOI: 10.21437/Interspeech.2016-1176, September 2016, pp. 545-549.
BibTeX TR2016-073 PDF
- @inproceedings{Isik2016sep,
- author = {Isik, Yusuf and {Le Roux}, Jonathan and Chen, Zhuo and Watanabe, Shinji and Hershey, John R.},
- title = {{Single-Channel Multi-Speaker Separation using Deep Clustering}},
- booktitle = {Interspeech},
- year = 2016,
- pages = {545--549},
- month = sep,
- doi = {10.21437/Interspeech.2016-1176},
- url = {https://www.merl.com/publications/TR2016-073}
- }