Phase Reconstruction with Learned Time-Frequency Representations for Single-Channel Speech Separation

Progress in solving the cocktail party problem, i.e., separating the speech from multiple overlapping speakers, has recently accelerated with the invention of techniques such as deep clustering and permutation free mask inference. These approaches typically focus on estimating target STFT magnitudes and ignore problems of phase inconsistency. In this paper, we explicitly integrate phase reconstruction into our separation algorithm using a loss function defined on timedomain signals. A deep neural network structure is defined by unfolding a phase reconstruction algorithm and treating each iteration as a layer in our network. Furthermore, instead of using fixed STFT/iSTFT time-frequency representations, we allow our network to learn a modified version of these representations from data. We compare several variants of these unfolded phase reconstruction networks achieving state of the art results on the publicly available wsj0-2mix dataset, and show improved performance when the STFT/iSTFT-like representations are allowed to adapt.