TR2014-104

Discriminatively Trained Recurrent Neural Networks for Single-Channel Speech Separation


    •  Weninger, F.; Le Roux, J.; Hershey, J.R.; Schuller, B., "Discriminatively Trained Recurrent Neural Networks for Single-Channel Speech Separation", IEEE Global Conference on Signal and Information Processing (GlobalSIP), DOI: 10.1109/GlobalSIP.2014.7032183, December 2014, pp. 577-581.
      BibTeX Download PDF
      • @inproceedings{Weninger2014dec,
      • author = {Weninger, F. and {Le Roux}, J. and Hershey, J.R. and Schuller, B.},
      • title = {Discriminatively Trained Recurrent Neural Networks for Single-Channel Speech Separation},
      • booktitle = {IEEE Global Conference on Signal and Information Processing (GlobalSIP)},
      • year = 2014,
      • pages = {577--581},
      • month = dec,
      • publisher = {IEEE},
      • doi = {10.1109/GlobalSIP.2014.7032183},
      • url = {http://www.merl.com/publications/TR2014-104}
      • }
  • MERL Contact:
  • Research Areas:

    Multimedia, Speech & Audio


This paper describes an in-depth investigation of training criteria, network architectures and feature representations for regression-based single-channel speech separation with deep neural networks (DNNs). We use a generic discriminative training criterion corresponding to optimal source reconstruction from time-frequency masks, and introduce its application to speech separation in a reduced feature space (Mel domain). A comparative evaluation of time-frequency mask estimation by DNNs, recurrent DNNs and non-negative matrix factorization on the 2nd CHiME Speech Separation and Recognition Challenge shows consistent improvements by discriminative training, whereas long short-term memory recurrent DNNs obtain the overall best results. Furthermore, our results confirm the importance of fine-tuning the feature representation for DNN training.