TR2015-094

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR


    •  Weninger, F.J.; Erdogan, H.; Watanabe, S.; Vincent, E.; Le Roux, J.; Hershey, J.R.; Schuller, B.W., "Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-Robust ASR", Latent Variable Analysis and Signal Separation Conference (LVA), DOI: 10.1007/978-3-319-22482-4_11, ISBN: 978-3-319-22482-4, August 2015, vol. 9237, pp. 91-99.
      BibTeX Download PDF
      • @inproceedings{Weninger2015aug,
      • author = {Weninger, F.J. and Erdogan, H. and Watanabe, S. and Vincent, E. and {Le Roux}, J. and Hershey, J.R. and Schuller, B.W.},
      • title = {Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-Robust ASR},
      • booktitle = {Latent Variable Analysis and Signal Separation Conference (LVA)},
      • year = 2015,
      • volume = 9237,
      • pages = {91--99},
      • month = aug,
      • doi = {10.1007/978-3-319-22482-4_11},
      • isbn = {978-3-319-22482-4},
      • url = {http://www.merl.com/publications/TR2015-094}
      • }
  • MERL Contacts:
  • Research Areas:

    Multimedia, Speech & Audio


We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76 % average word error rate, which is, to our knowledge, the best score to date.