Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD   Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
      Date: December 18, 2019
      Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
    •  
    •  AWARD   Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to: Zhong-Qiu Wang
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  
    •  AWARD   MERL's Speech Team Achieves World's 2nd Best Performance at the Third CHiME Speech Separation and Recognition Challenge
      Date: December 15, 2015
      Awarded to: John R. Hershey, Takaaki Hori, Jonathan Le Roux and Shinji Watanabe
      MERL Contacts: Takaaki Hori; Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • The results of the third 'CHiME' Speech Separation and Recognition Challenge were publicly announced on December 15 at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) held in Scottsdale, Arizona, USA. MERL's Speech and Audio Team, in collaboration with SRI, ranked 2nd out of 26 teams from Europe, Asia and the US. The task this year was to recognize speech recorded using a tablet in real environments such as cafes, buses, or busy streets. Due to the high levels of noise and the distance from the speaker's mouth to the microphones, this is very challenging task, where the baseline system only achieved 33.4% word error rate. The MERL/SRI system featured state-of-the-art techniques including multi-channel front-end, noise-robust feature extraction, and deep learning for speech enhancement, acoustic modeling, and language modeling, leading to a dramatic 73% reduction in word error rate, down to 9.1%. The core of the system has since been released as a new official challenge baseline for the community to use.
    •  

    See All Awards for Speech & Audio
  • News & Events


    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • SA1358: Multimodal AI

      MERL is looking for an intern to work on fundamental research in the area of audiovisual semantic understanding for scene-aware dialog technologies by combining end-to-end dialog and video scene understanding technologies. The intern will collaborate with MERL researchers to derive and implement new models, conduct experiments, and prepare results for high impact publication. The ideal candidate would be a senior Ph.D. student with experience in one or more of video captioning/description, end-to-end conversation modeling and natural language processing including practical machine learning algorithms with related programming skills. The duration of the internship is expected to be 3-6 months.


    See All Internships for Speech & Audio
  • Recent Publications

    •  Chang, X., Zhang, W., Qian, Y., Le Roux, J., Watanabe, S., "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), ISBN: 978-1-7281-0305-1, December 2019, pp. 237-144.
      BibTeX Download PDFAbout TR2019-157
      • @inproceedings{Chang2019dec,
      • author = {Chang, Xuankai and Zhang, Wangyou and Qian, Yanmin and Le Roux, Jonathan and Watanabe, Shinji},
      • title = {MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2019,
      • pages = {237--144},
      • month = dec,
      • isbn = {978-1-7281-0305-1},
      • url = {https://www.merl.com/publications/TR2019-157}
      • }
    •  Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Enrique Yalta Soplin, N., Yamamoto, R., Wang, X., Watanabe, S., Yoshimura, T., Zhang, W., "A Comparative Study on Transformer Vs RNN in Speech Applications", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2019.
      BibTeX Download PDFAbout TR2019-158
      • @inproceedings{Karita2019dec,
      • author = {Karita, Shigeki and Chen, Nanxin and Hayashi, Tomoki and Hori, Takaaki and Inaguma, Hirofumi and Jiang, Ziyan and Someki, Masao and Enrique Yalta Soplin, Nelson and Yamamoto, Ryuichi and Wang, Xiaofei and Watanabe, Shinji and Yoshimura, Takenori and Zhang, Wangyou},
      • title = {A Comparative Study on Transformer Vs RNN in Speech Applications},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2019,
      • month = dec,
      • url = {https://www.merl.com/publications/TR2019-158}
      • }
    •  Moritz, N., Hori, T., Le Roux, J., "Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), ISBN: 978-1-7281-0305-1, December 2019, pp. 936-943.
      BibTeX Download PDFAbout TR2019-159
      • @inproceedings{Moritz2019dec,
      • author = {Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2019,
      • pages = {936--943},
      • month = dec,
      • isbn = {978-1-7281-0305-1},
      • url = {https://www.merl.com/publications/TR2019-159}
      • }
    •  Kavalerov, I., Wisdom, S., Erdogan, H., Patton, B., Wilson, K., Le Roux, J., Hershey, J., "Universal Sound Separation", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), DOI: 10.1109/WASPAA.2019.8937253, ISSN: 1947-1629, ISBN: 978-1-7281-1123-0, October 2019, pp. 170-174.
      BibTeX Download PDFAbout TR2019-123
      • @inproceedings{Kavalerov2019oct,
      • author = {Kavalerov, Ilya and Wisdom, Scott and Erdogan, Hakan and Patton, Brian and Wilson, Kevin and Le Roux, Jonathan and Hershey, John},
      • title = {Universal Sound Separation},
      • booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
      • year = 2019,
      • pages = {170--174},
      • month = oct,
      • doi = {10.1109/WASPAA.2019.8937253},
      • issn = {1947-1629},
      • isbn = {978-1-7281-1123-0},
      • url = {https://www.merl.com/publications/TR2019-123}
      • }
    •  Manilow, E., Wichern, G., Seetharaman, P., Le Roux, J., "Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), DOI: 10.1109/WASPAA.2019.8937170, ISSN: 1947-1629, ISBN: 978-1-7281-1123-0, October 2019, pp. 45-49.
      BibTeX Download PDFAbout TR2019-124
      • @inproceedings{Manilow2019oct,
      • author = {Manilow, Ethan and Wichern, Gordon and Seetharaman, Prem and Le Roux, Jonathan},
      • title = {Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity},
      • booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
      • year = 2019,
      • pages = {45--49},
      • month = oct,
      • doi = {10.1109/WASPAA.2019.8937170},
      • issn = {1947-1629},
      • isbn = {978-1-7281-1123-0},
      • url = {https://www.merl.com/publications/TR2019-124}
      • }
    •  Baskar, M.K., Watanabe, S., Astudillo, R., Hori, T., Burget, L., Cernocky, J.H., "Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text", Interspeech, September 2019.
      BibTeX Download PDFAbout TR2019-100
      • @inproceedings{Baskar2019sep,
      • author = {Baskar, Murali Karthick and Watanabe, Shinji and Astudillo, Ramon and Hori, Takaaki and Burget, Lukas and Cernocky, Jan, Honza},
      • title = {Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text},
      • booktitle = {Interspeech},
      • year = 2019,
      • month = sep,
      • url = {https://www.merl.com/publications/TR2019-100}
      • }
    •  Hori, C., Cherian, A., Marks, T., Hori, T., "Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog", Interspeech, September 2019, pp. 1886-1890.
      BibTeX Download PDFAbout TR2019-097
      • @inproceedings{Hori2019sep,
      • author = {Hori, Chiori and Cherian, Anoop and Marks, Tim and Hori, Takaaki},
      • title = {Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog},
      • booktitle = {Interspeech},
      • year = 2019,
      • pages = {1886--1890},
      • month = sep,
      • publisher = {ISCA},
      • url = {https://www.merl.com/publications/TR2019-097}
      • }
    •  Karafiat, M., Baskar, M.K., Watanabe, S., Hori, T., Wiesner, M., Cernocky, J.H., "Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems", Interspeech, September 2019.
      BibTeX Download PDFAbout TR2019-103
      • @inproceedings{Karafiat2019sep,
      • author = {Karafiat, Martin and Baskar, Murali Karthick and Watanabe, Shinji and Hori, Takaaki and Wiesner, Matthew and Cernocky, Jan, Honza},
      • title = {Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems},
      • booktitle = {Interspeech},
      • year = 2019,
      • month = sep,
      • url = {https://www.merl.com/publications/TR2019-103}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Free Downloads