Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD   Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020
      Date: October 15, 2020
      Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.

        The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
    •  
    •  AWARD   Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
      Date: December 18, 2019
      Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
    •  
    •  AWARD   Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to: Zhong-Qiu Wang
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  

    See All Awards for Speech & Audio
  • News & Events

    •  NEWS   MERL Congratulates Recipients of 2022 IEEE Technical Field Awards in Signal Processing
      Date: July 26, 2021
      MERL Contacts: Petros Boufounos; Jonathan Le Roux; Philip Orlik; Anthony Vetro
      Research Areas: Signal Processing, Speech & Audio
      Brief
      • IEEE has announced that the recipients of the 2022 IEEE James L. Flanagan Speech and Audio Processing Award will be Hervé Bourlard (EPFL/Idiap Research Institute) and Nelson Morgan (ICSI), "For contributions to neural networks for statistical speech recognition," and the recipient of the 2022 IEEE Fourier Award for Signal Processing will be Ali Sayed (EPFL), "For contributions to the theory and practice of adaptive signal processing." More details about the contributions of Prof. Bourlard and Prof. Morgan can be found in the announcements by ICSI and EPFL, and those of Prof. Sayed in EPFL's announcement. Mitsubishi Electric Research Laboratories (MERL) has recently become the new sponsor of these two prestigious awards, and extends our warmest congratulations to all of the 2022 award recipients.

        The IEEE Board of Directors established the IEEE James L. Flanagan Speech and Audio Processing Award in 2002 for outstanding contributions to the advancement of speech and/or audio signal processing, while the IEEE Fourier Award for Signal Processing was established in 2012 for outstanding contribution to the advancement of signal processing, other than in the areas of speech and audio processing. Both awards have recognized the contributions of some of the most renowned pioneers and leaders in their respective fields. MERL is proud to support the recognition of outstanding contributions to the signal processing field through its sponsorship of these awards.
    •  
    •  NEWS   MERL becomes new sponsor of two prestigious IEEE Technical Field Awards in Signal Processing
      Date: July 9, 2021
      MERL Contacts: Petros Boufounos; Jonathan Le Roux; Philip Orlik; Anthony Vetro
      Research Areas: Signal Processing, Speech & Audio
      Brief
      • Mitsubishi Electric Research Laboratories (MERL) has become the new sponsor of two prestigious IEEE Technical Field Awards in Signal Processing, the IEEE James L. Flanagan Speech and Audio Processing Award and the IEEE Fourier Award for Signal Processing, for the years 2022-2031. "MERL is proud to support the recognition of outstanding contributions to signal processing by sponsoring both the IEEE James L. Flanagan Speech and Audio Processing Award and the IEEE Fourier Award for Signal Processing. These awards celebrate the creativity and innovation in the field that touch many aspects of our lives and drive our society forward" said Dr. Anthony Vetro, VP and Director at MERL.

        The IEEE Board of Directors established the IEEE James L. Flanagan Speech and Audio Processing Award in 2002 for outstanding contributions to the advancement of speech and/or audio signal processing, while the IEEE Fourier Award for Signal Processing was established in 2012 for outstanding contribution to the advancement of signal processing, other than in the areas of speech and audio processing. Both awards have since recognized the contributions of some of the most renowned pioneers and leaders in their respective fields.

        By underwriting these IEEE Technical Field Awards, MERL continues to make a mark by supporting the advancement of technology that makes lasting changes in the world.
    •  

    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • SA1612: End-to-end speech and audio processing

      MERL is looking for interns to work on fundamental research in the area of end-to-end speech and audio processing for new and challenging environments using advanced machine learning techniques. The intern will collaborate with MERL researchers to derive and implement new models and learning methods, conduct experiments, and prepare results for high-impact publication. The ideal candidates would be senior Ph.D. students with experience in one or more of automatic speech recognition, speech enhancement, sound event detection, and natural language processing, including good theoretical and practical knowledge of relevant machine learning algorithms with related programming skills. The internship will take place during fall/winter 2021 with an expected duration of 3-6 months and a flexible start date. This internship is preferred to be onsite at MERL, but may be done remotely where you live if the COVID pandemic makes it necessary.


    See All Internships for Speech & Audio
  • Openings


    See All Openings at MERL
  • Recent Publications

    •  Chatterjee, M., Le Roux, J., Ahuja, N., Cherian, A., "Visual Scene Graphs for Audio Source Separation", IEEE International Conference on Computer Vision (ICCV), October 2021.
      BibTeX TR2021-095 PDF
      • @inproceedings{Chatterjee2021oct,
      • author = {Chatterjee, Moitreya and Le Roux, Jonathan and Ahuja, Narendra and Cherian, Anoop},
      • title = {Visual Scene Graphs for Audio Source Separation},
      • booktitle = {IEEE International Conference on Computer Vision (ICCV)},
      • year = 2021,
      • month = oct,
      • url = {https://www.merl.com/publications/TR2021-095}
      • }
    •  Higuchi, Y., Moritz, N., Le Roux, J., Hori, T., "Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-571, September 2021, pp. 726-730.
      BibTeX TR2021-103 PDF
      • @inproceedings{Higuchi2021sep,
      • author = {Higuchi, Yosuke and Moritz, Niko and Le Roux, Jonathan and Hori, Takaaki},
      • title = {Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {726--730},
      • month = sep,
      • doi = {10.21437/Interspeech.2021-571},
      • url = {https://www.merl.com/publications/TR2021-103}
      • }
    •  Hori, T., Moritz, N., Hori, C., Le Roux, J., "Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-1643, August 2021, pp. 2097-2101.
      BibTeX TR2021-100 PDF
      • @inproceedings{Hori2021aug3,
      • author = {Hori, Takaaki and Moritz, Niko and Hori, Chiori and Le Roux, Jonathan},
      • title = {Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {2097--2101},
      • month = aug,
      • doi = {10.21437/Interspeech.2021-1643},
      • url = {https://www.merl.com/publications/TR2021-100}
      • }
    •  Hori, C., Hori, T., Le Roux, J., "Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-1975, August 2021, pp. 586–590.
      BibTeX TR2021-093 PDF
      • @inproceedings{Hori2021aug2,
      • author = {Hori, Chiori and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {586–590},
      • month = aug,
      • publisher = {ISCA},
      • doi = {10.21437/Interspeech.2021-1975},
      • url = {https://www.merl.com/publications/TR2021-093}
      • }
    •  Moritz, N., Hori, T., Le Roux, J., "Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-1693, August 2021, pp. 1822-1826.
      BibTeX TR2021-094 PDF
      • @inproceedings{Moritz2021aug,
      • author = {Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {1822--1826},
      • month = aug,
      • doi = {10.21437/Interspeech.2021-1693},
      • url = {https://www.merl.com/publications/TR2021-094}
      • }
    •  Moritz, N., Hori, T., Le Roux, J., "Capturing Multi-Resolution Context by Dilated Self-Attention", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP39728.2021.9415001, June 2021, pp. 5869-5873.
      BibTeX TR2021-036 PDF
      • @inproceedings{Moritz2021jun,
      • author = {Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Capturing Multi-Resolution Context by Dilated Self-Attention},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2021,
      • pages = {5869--5873},
      • month = jun,
      • doi = {10.1109/ICASSP39728.2021.9415001},
      • url = {https://www.merl.com/publications/TR2021-036}
      • }
    •  Hung, Y.-N., Wichern, G., Le Roux, J., "Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP39728.2021.9413358, June 2021, pp. 46-50.
      BibTeX TR2021-069 PDF
      • @inproceedings{Hung2021jun,
      • author = {Hung, Yun-Ning and Wichern, Gordon and Le Roux, Jonathan},
      • title = {Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2021,
      • pages = {46--50},
      • month = jun,
      • doi = {10.1109/ICASSP39728.2021.9413358},
      • issn = {2379-190X},
      • isbn = {978-1-7281-7605-5},
      • url = {https://www.merl.com/publications/TR2021-069}
      • }
    •  Khurana, S., Moritz, N., Hori, T., Le Roux, J., "Unsupervised Domain Adaptation For Speech Recognition via Uncertainty Driven Self-Training", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP39728.2021.9414299, June 2021, pp. 6553-6557.
      BibTeX TR2021-039 PDF
      • @inproceedings{Khurana2021jun,
      • author = {Khurana, Sameer and Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Unsupervised Domain Adaptation For Speech Recognition via Uncertainty Driven Self-Training},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2021,
      • pages = {6553--6557},
      • month = jun,
      • doi = {10.1109/ICASSP39728.2021.9414299},
      • url = {https://www.merl.com/publications/TR2021-039}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Software Downloads