Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD    Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020
      Date: October 15, 2020
      Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.

        The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
    •  
    •  AWARD    Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
      Date: December 18, 2019
      Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
    •  
    •  AWARD    Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to: Zhong-Qiu Wang
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  

    See All Awards for Speech & Audio
  • News & Events


    See All News & Events for Speech & Audio
  • Research Highlights

  • Openings


    See All Openings at MERL
  • Recent Publications

    •  Chang, X., Moritz, N., Hori, T., Watanabe, S., Le Roux, J., "Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2022.
      BibTeX TR2022-021 PDF
      • @inproceedings{Chang2022apr,
      • author = {Chang, Xuankai and Moritz, Niko and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan},
      • title = {Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2022,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2022-021}
      • }
    •  Higuchi, Y., Moritz, N., Le Roux, J., Hori, T., "Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2022.
      BibTeX TR2022-026 PDF
      • @inproceedings{Higuchi2022apr,
      • author = {Higuchi, Yosuke and Moritz, Niko and Le Roux, Jonathan and Hori, Takaaki},
      • title = {Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2022,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2022-026}
      • }
    •  Moritz, N., Hori, T., Watanabe, S., Le Roux, J., "Sequence Transduction with Graph-based Supervision", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2022.
      BibTeX TR2022-024 PDF
      • @inproceedings{Moritz2022apr,
      • author = {Moritz, Niko and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan},
      • title = {Sequence Transduction with Graph-based Supervision},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2022,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2022-024}
      • }
    •  Petermann, D., Wichern, G., Wang, Z.-Q., Le Roux, J., "The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2022.
      BibTeX TR2022-022 PDF
      • @inproceedings{Petermann2022apr,
      • author = {Petermann, Darius and Wichern, Gordon and Wang, Zhong-Qiu and Le Roux, Jonathan},
      • title = {The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2022,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2022-022}
      • }
    •  Shah, A.P., Geng, S., Gao, P., Cherian, A., Hori, T., Marks, T.K., Le Roux, J., Hori, C., "Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2022.
      BibTeX TR2022-019 PDF
      • @inproceedings{Shah2022apr,
      • author = {Shah, Ankit Parag and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Marks, Tim K. and Le Roux, Jonathan and Hori, Chiori},
      • title = {Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2022,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2022-019}
      • }
    •  Slizovskaia, O., Wichern, G., Wang, Z.-Q., Le Roux, J., "Locate This, Not That: Class-Conditioned Sound Event DOA Estimation", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2022.
      BibTeX TR2022-023 PDF
      • @inproceedings{Slizovskaia2022mar,
      • author = {Slizovskaia, Olga and Wichern, Gordon and Wang, Zhong-Qiu and Le Roux, Jonathan},
      • title = {Locate This, Not That: Class-Conditioned Sound Event DOA Estimation},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2022,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2022-023}
      • }
    •  Hori, C., Shah, A.P., Geng, S., Gao, P., Cherian, A., Hori, T., Le Roux, J., Marks, T.K., "Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10", The 10th Dialog System Technology Challenge Workshop at AAAI, February 2022.
      BibTeX TR2022-016 PDF
      • @inproceedings{Hori2022feb,
      • author = {Hori, Chiori and Shah, Ankit Parag and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Le Roux, Jonathan and Marks, Tim K.},
      • title = {Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10},
      • booktitle = {The 10th Dialog System Technology Challenge Workshop at AAAI},
      • year = 2022,
      • month = feb,
      • url = {https://www.merl.com/publications/TR2022-016}
      • }
    •  Shah, A.P., Hori, T., Le Roux, J., Hori, C., DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning, February 2022.
      BibTeX TR2022-025 PDF
      • @book{Shah2022feb,
      • author = {Shah, Ankit Parag and Hori, Takaaki and Le Roux, Jonathan and Hori, Chiori},
      • title = {DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning},
      • year = 2022,
      • month = feb,
      • url = {https://www.merl.com/publications/TR2022-025}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Software Downloads