TR2015-099

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features


    •  Tachioka, Y., Watanabe, S., "Uncertainty Training and Decoding Methods of Deep Neural Networks Based on Stochastic Representation of Enhanced Features", Interspeech, September 2015, vol. 1 or 5, pp. 3541.
      BibTeX TR2015-099 PDF
      • @inproceedings{Tachioka2015sep,
      • author = {Tachioka, Y. and Watanabe, S.},
      • title = {Uncertainty Training and Decoding Methods of Deep Neural Networks Based on Stochastic Representation of Enhanced Features},
      • booktitle = {Interspeech},
      • year = 2015,
      • volume = {1 or 5},
      • pages = 3541,
      • month = sep,
      • isbn = {978-1-5108-1790-6},
      • url = {https://www.merl.com/publications/TR2015-099}
      • }
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using the expectation of ASR decoding/training process with respect to the probabilistic representation of input features. However, unlike the Gaussian mixture model, it is difficult for Deep Neural Network (DNN) to deal with this expectation analytically due to the nonlinear activations. This paper proposes efficient Monte-Carlo approximation methods for this expectation calculation to realize DNN based uncertainty decoding and training. It first models the uncertainty of input features with linear interpolation between original and enhanced feature vectors with a random interpolation coefficient. By sampling input features b ed on this stochastic process in training, DNN can learn to generalize the variations of enhanced features. Our method also samples input features in decoding, and integrates multiple recognition hypotheses obtained from the samples. Experiments on the reverberated noisy speech recognition tasks (the second CHiME and REVERB challenges) show the effectiveness of our techniques.