Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD   Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to:
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  
    •  AWARD   MERL's Speech Team Achieves World's 2nd Best Performance at the Third CHiME Speech Separation and Recognition Challenge
      Date: December 15, 2015
      Awarded to:
      MERL Contacts: Takaaki Hori; Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • The results of the third 'CHiME' Speech Separation and Recognition Challenge were publicly announced on December 15 at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) held in Scottsdale, Arizona, USA. MERL's Speech and Audio Team, in collaboration with SRI, ranked 2nd out of 26 teams from Europe, Asia and the US. The task this year was to recognize speech recorded using a tablet in real environments such as cafes, buses, or busy streets. Due to the high levels of noise and the distance from the speaker's mouth to the microphones, this is very challenging task, where the baseline system only achieved 33.4% word error rate. The MERL/SRI system featured state-of-the-art techniques including multi-channel front-end, noise-robust feature extraction, and deep learning for speech enhancement, acoustic modeling, and language modeling, leading to a dramatic 73% reduction in word error rate, down to 9.1%. The core of the system has since been released as a new official challenge baseline for the community to use.
    •  
    •  AWARD   Awaya Prize Young Researcher Award
      Date: March 11, 2014
      Awarded to:
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • MELCO researcher Yuuki Tachioka received the Awaya Prize Young Researcher Award from the Acoustical Society of Japan (ASJ) for "effectiveness of discriminative approaches for speech recognition under noisy environments on the 2nd CHiME Challenge", which was based on joint work with MERL Speech & Audio team researchers Shinji Watanabe, Jonathan Le Roux and John R. Hershey.
    •  

    See All Awards for Speech & Audio
  • News & Events

    •  NEWS   MERL presenting 9 papers at ICASSP 2018
      Date: April 15, 2018 - April 20, 2018
      Where: Calgary, AB
      MERL Contacts: Petros Boufounos; Takaaki Hori; Toshiaki Koike-Akino; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Philip Orlik; Pu (Perry) Wang
      Research Areas: Computational Sensing, Digital Video, Speech & Audio
      Brief
      • MERL researchers are presenting 9 papers at the IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which is being held in Calgary from April 15-20, 2018. Topics to be presented include recent advances in speech recognition, audio processing, and computational sensing. MERL is also a sponsor of the conference.

        ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year.
    •  
    •  TALK   Theory and Applications of Sparse Model-Based Recurrent Neural Networks
      Date & Time: Tuesday, March 6, 2018; 12:00 PM
      Speaker: Scott Wisdom, Affectiva
      MERL Host: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Recurrent neural networks (RNNs) are effective, data-driven models for sequential data, such as audio and speech signals. However, like many deep networks, RNNs are essentially black boxes; though they are effective, their weights and architecture are not directly interpretable by practitioners. A major component of my dissertation research is explaining the success of RNNs and constructing new RNN architectures through the process of "deep unfolding," which can construct and explain deep network architectures using an equivalence to inference in statistical models. Deep unfolding yields principled initializations for training deep networks, provides insight into their effectiveness, and assists with interpretation of what these networks learn.

        In particular, I will show how RNNs with rectified linear units and residual connections are a particular deep unfolding of a sequential version of the iterative shrinkage-thresholding algorithm (ISTA), a simple and classic algorithm for solving L1-regularized least-squares. This equivalence allows interpretation of state-of-the-art unitary RNNs (uRNNs) as an unfolded sparse coding algorithm. I will also describe a new type of RNN architecture called deep recurrent nonnegative matrix factorization (DR-NMF). DR-NMF is an unfolding of a sparse NMF model of nonnegative spectrograms for audio source separation. Both of these networks outperform conventional LSTM networks while also providing interpretability for practitioners.
    •  

    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • SA1132: End-to-end acoustic analysis recognition and inference

      MERL is looking for an intern to work on fundamental research in the area of end-to-end acoustic analysis, recognition, and inference using machine learning techniques such as deep learning. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for high impact publication. The ideal candidate would be a senior Ph.D. student with experience in one or more of source separation, speech recognition, and natural language processing including practical machine learning algorithms with related programming skills. The duration of the internship is expected to be 3-6 months.


    See All Internships for Speech & Audio
  • Recent Publications

    •  Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", Annual Meeting of the Association for Computational Linguistics (ACL), Jul 16, 2018.
    •  Hori, C., Alamri, H., Wang, J., Wichern, G., Hori, T., Cherian, A., Marks, T.K., Cartillier, V., Lopes, R., Das, A., Essa, I., Batra, D., Parikh, D., "End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features", arXiv, July 13, 2018.
      BibTeX Download PDFAbout TR2018-085
      • @techreport{MERL_TR2018-085,
      • author = {Hori, C. and Alamri, H. and Wang, J. and Wichern, G. and Hori, T. and Cherian, A. and Marks, T.K. and Cartillier, V. and Lopes, R. and Das, A. and Essa, I. and Batra, D. and Parikh, D.},
      • title = {End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR2018-085},
      • month = jul,
      • year = 2018,
      • url = {http://www.merl.com/publications/TR2018-085/}
      • }
    •  Alamri, H., Cartillier, V., Lopes, R., Das, A., Wang, J., Essa, I., Batra, D., Parikh, D., Cherian, A., Marks, T.K., Hori, C., "Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7", arXiv, July 12, 2018.
      BibTeX Download PDFAbout TR2018-069
      • @techreport{MERL_TR2018-069,
      • author = {Alamri, H. and Cartillier, V. and Lopes, R. and Das, A. and Wang, J. and Essa, I. and Batra, D. and Parikh, D. and Cherian, A. and Marks, T.K. and Hori, C.},
      • title = {Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR2018-069},
      • month = jul,
      • year = 2018,
      • url = {http://www.merl.com/publications/TR2018-069/}
      • }
    •  Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", arXiv, July 10, 2018.
      BibTeX Download PDFAbout TR2018-058
      • @techreport{MERL_TR2018-058,
      • author = {Seki, H. and Hori, T. and Watanabe, S. and Le Roux, J. and Hershey, J.},
      • title = {A Purely End-to-end System for Multi-speaker Speech Recognition},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR2018-058},
      • month = jul,
      • year = 2018,
      • url = {http://www.merl.com/publications/TR2018-058/}
      • }
    •  Wang, Z.-Q., Le Roux, J., Hershey, J., "End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction", arXiv, July 9, 2018.
      BibTeX Download PDFAbout TR2018-051
      • @techreport{MERL_TR2018-051,
      • author = {Wang, Z.-Q. and Le Roux, J. and Hershey, J.},
      • title = {End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR2018-051},
      • month = jul,
      • year = 2018,
      • url = {http://www.merl.com/publications/TR2018-051/}
      • }
    •  Ochiai, T., Watanabe, S., Katagiri, S., Hori, T., Hershey, J.R., "Speaker Adaptation for Multichannel End-to-End Speech Recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2018.
    •  Seki, H., Watanabe, S., Hori, T., Le Roux, J., Hershey, J.R., "An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2018.
    •  Settle, S., Le Roux, J., Hori, T., Watanabe, S., Hershey, J.R., "End-to-End Multi-Speaker Speech Recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2018.
    See All Publications for Speech & Audio
  • Videos

  • Free Downloads