Single-Channel Multi-Speaker Separation using Deep Clustering

Training deep discriminative embeddings to solve the cocktail party problem.

MERL Researchers: John R. Hershey, Jonathan Le Roux, Shinji Watanabe, Bret Harsham (Speech & Audio).
Joint work with: Yusuf Isik (Sabanci University, TUBITAK BILGEM) and Zhuo Chen (Columbia University).

Search MERL publications by keyword: Speech & Audio, separation, deep learning.


The human auditory system gives us the extraordinary ability to converse in the midst of a noisy throng of party goers. Solving this so-called cocktail party problem has proven extremely challenging for computers, and separating and recognizing speech in such conditions has been the holy grail of speech processing for more than 50 years. Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering, producing unprecedented speaker-independent single-channel separation performance on two-speaker and three-speaker mixtures.

In our recently proposed deep clustering framework [Hershey et al., ICASSP 2016], a neural network is trained to assign an embedding vector to each element of a multi-dimensional signal, such that clustering the embeddings yields a desired segmentation of the signal. In the cocktail-party problem, the embeddings are assigned to each time-frequency (TF) index of the short-time Fourier transform (STFT) of the mixture of speech signals. Clustering these embeddings yields an assignment of each TF bin to one of the inferred sources. These assignments are used as a masking function to extract the dominant parts of each source.

Preliminary work on this method produced remarkable performance, improving SNR by 6 dB on the task of separating two unknown speakers from a single-channel mixture [Hershey et al., ICASSP 2016]. In [Isik et al., Interspeech 2016], we presented improvements and extensions that enabled a leap forward in separation quality, reaching levels of improvement -- over 10 dB SNR gain -- that were previously unobtainable even in simpler speech enhancement tasks. In particular, we showed how the model could be trained in an end-to-end fashion, optimizing the whole architecture for best signal quality. We also evaluated our method using automatic speech recognition (ASR), and showed that it can reduce the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

In May 2017, our research was featured in a Mitsubishi Electric Corporation Press Release, and we presented a live demonstration of our speech separation system at the company's annual R&D event in Tokyo, Japan. The demonstration was widely covered by the media, with reports by three of the main Japanese TV stations and multiple articles in print and online newspapers.

Media Coverage:

Videos

Single Channel Multi Speaker Separation Using Deep Clustering

PowerPoint Demo

Deep_Clustering_Demo.pptx

Audio Samples


Female-female mixture, Original [ wav mp3 ogg ]


Female-female mixture, speaker 1, CASA [ wav mp3 ogg ]


Female-female mixture, speaker 2, CASA [ wav mp3 ogg ]


Female-female mixture, speaker 1, deep clustering [ wav mp3 ogg ]


Female-female mixture, speaker 2, deep clustering [ wav mp3 ogg ]

 


Male-male mixture, Original [ wav mp3 ogg ]


Male-male mixture, speaker 1, CASA [ wav mp3 ogg ]


Male-male mixture, speaker 2, CASA [ wav mp3 ogg ]


Male-male mixture, speaker 1, deep clustering [ wav mp3 ogg ]


Male-male mixture, speaker 2, deep clustering [ wav mp3 ogg ]

Script to generate the multi-speaker dataset using WSJ0

create-speaker-mixtures.zip

References