TR2014-033

The MERL/MELCO/TUM System for the REVERB Challenge Using Deep Recurrent Neural Network Feature Enhancement


    •  Weninger, F., Watanabe, S., Le Roux, J., Hershey, J.R., Tachioka, Y., Geiger, J., Schuller, B., Rigoll, G., "The MERL/MELCO/TUM System for the REVERB Challenge Using Deep Recurrent Neural Network Feature Enhancement", IEEE REVERB Workshop, May 2014.
      BibTeX TR2014-033 PDF
      • @inproceedings{Weninger2014may2,
      • author = {Weninger, F. and Watanabe, S. and {Le Roux}, J. and Hershey, J.R. and Tachioka, Y. and Geiger, J. and Schuller, B. and Rigoll, G.},
      • title = {The MERL/MELCO/TUM System for the REVERB Challenge Using Deep Recurrent Neural Network Feature Enhancement},
      • booktitle = {IEEE REVERB Workshop},
      • year = 2014,
      • month = may,
      • url = {https://www.merl.com/publications/TR2014-033}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Speech & Audio

TR Image
Visualization of the k-th cell in the n-th layer of an LSTM- RNN. Arrows denote data flow and 1 denotes a delay of one timestep.
Abstract:

This paper describes our joint submission to the REVERB Challenge, which calls for automatic speech recognition systems which are robust against varying room acoustics. Our approach uses deep recurrent neural network (DRNN) based feature enhancement in the log spectral domain as a single-channel front-end. The system is generalized to multi-channel audio by performing single-channel feature enhancement on the output of a sum-and-delay beamformer with direction of arrival estimation. On the back-end side, we employ a state-of-the-art speech recognizer using feature transformations, utterance based adaptation, and discriminative training. Results on the REVERB data indicate that the proposed front-end provides acceptable results already with a simple clean trained recognizer while being complementary to the improved back-end. The proposed ASR system with eight-channel input and feature enhancement achieves average word error rates (WERs) of 7.75 % and 20.09 % on the simulated and real evaluation sets, which is a drastic improvement over the Challenge baseline (25.26 and 49.16 %). Further improvements can be obtained by system combination with a DRNN tandem recognizer, reaching 7.02 % and 19.61 % WER.