TR2015-137

Automation of System Building for State-of-the-art Large Vocabulary Speech Recognition Using Evolution Strategy


    •  Moriya, T., Shinozaki, T., Watanabe, S., Duh, K., "Automation of System Building for State-of-the-Art Large Vocabulary Speech Recognition Using Evolution Strategy", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), DOI: 10.1109/​ASRU.2015.7404852, December 2015, pp. 610-616.
      BibTeX TR2015-137 PDF
      • @inproceedings{Moriya2015dec,
      • author = {Moriya, T. and Shinozaki, T. and Watanabe, S. and Duh, K.},
      • title = {Automation of System Building for State-of-the-Art Large Vocabulary Speech Recognition Using Evolution Strategy},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2015,
      • pages = {610--616},
      • month = dec,
      • doi = {10.1109/ASRU.2015.7404852},
      • url = {https://www.merl.com/publications/TR2015-137}
      • }
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

When building a state-of-the-art speech recognition system, a major challenge is the laborious effort required by human experts in tuning numerous parameters. The goal of this paper is to automate the process. We propose to use covariance matrix adaptation evolution strategy (CMA-ES), a meta-heuristic method known to work well on various blackbox optimization problems. Further, we extend CMA-ES to perform multiobjective optimization, giving a high-accuracy speech recognition system with reasonable model size. We apply the proposed automation method to building GMM and DNN HMMbased systems with the Corpus of Spontaneous Japanese (CSJ), a widely used large-scale Japanese speech corpus. Experiments are performed using the TSUBAME 2.5 supercomputer, demonstrating the evolution of a large vocabulary speech recognition system. The optimized training code will be released in the Kaldi speech recognition toolkit as the first publicly available recipe for Japanese large vocabulary speech recognition.