Alternative Objective Functions for Deep Clustering

The recently proposed deep clustering framework represents a significant step towards solving the cocktail party problem. This study proposes and compares a variety of alternative objective functions for training deep clustering networks. In addition, whereas the original deep clustering work relied on k-means clustering for test-time inference, here we investigate inference methods that are matched to the training objective. Furthermore, we explore the use of an improved chimera network architecture for speech separation, which combines deep clustering with mask-inference networks in a multiobjective training scheme. The deep clustering loss acts as a regularizer while training the end-to-end mask inference network for best separation. With further iterative phase reconstruction, our best proposed method achieves a state-of-the-art 11.5 dB signal-to-distortion ratio (SDR) result on the publicly available wsj0-2mix dataset, with a much simpler architecture than the previous best approach.