Bootstrapping Single-Channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures

Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Such systems are normally trained on sound mixtures where the ground truth decomposition is already known. In this work, we use an unsupervised spatial source separation on stereo mixtures which generates initial decompositions of mixtures to train a deep learning source separation model. These estimated decompositions vary greatly in quality across the training mixtures. To overcome this, we weight the data during training using a confidence measure that assesses which mixtures or parts of mixtures are well-separated by the unsupervised algorithm. Once trained, the model can be applied to separate single-channel mixtures, where no source direction information is available. The idea is to use simple, low-level processing to separate sources in an unsupervised fashion, identify easy conditions, and then use that knowledge to bootstrap a (self-)supervised source separation model for difficult conditions. We also explore using the two approaches in an ensemble.