Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD    Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020
      Date: October 15, 2020
      Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.

        The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
    •  
    •  AWARD    Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
      Date: December 18, 2019
      Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
    •  
    •  AWARD    Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to: Zhong-Qiu Wang
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  

    See All Awards for Speech & Audio
  • News & Events

    •  EVENT    MERL's Virtual Open House 2022
      Date & Time: Monday, December 12, 2022; 1:00pm-5:30pm ET
      Location: Mitsubishi Electric Research Laboratories (MERL)/Virtual
      Research Areas: Applied Physics, Artificial Intelligence, Communications, Computational Sensing, Computer Vision, Control, Data Analytics, Dynamical Systems, Electric Systems, Electronic and Photonic Devices, Machine Learning, Multi-Physical Modeling, Optimization, Robotics, Signal Processing, Speech & Audio, Digital Video
      Brief
      • Join MERL's virtual open house on December 12th, 2022! Featuring a keynote, live sessions, research area booths, and opportunities to interact with our research team. Discover who we are and what we do, and learn about internship and employment opportunities.
    •  
    •  NEWS    MERL researchers presenting five papers at NeurIPS 2022
      Date: November 29, 2022 - December 9, 2022
      Where: NeurIPS 2022
      MERL Contacts: Moitreya Chatterjee; Anoop Cherian; Michael J. Jones; Suhas Lohit
      Research Areas: Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio
      Brief
      • MERL researchers are presenting 5 papers at the NeurIPS Conference, which will be held in New Orleans from Nov 29-Dec 1st, with virtual presentations in the following week. NeurIPS is one of the most prestigious and competitive international conferences in machine learning.

        MERL papers in NeurIPS 2022:

        1. “AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments” by Sudipta Paul, Amit Roy-Chowdhary, and Anoop Cherian

        This work proposes a unified multimodal task for audio-visual embodied navigation where the navigating agent can also interact and seek help from a human/oracle in natural language when it is uncertain of its navigation actions. We propose a multimodal deep hierarchical reinforcement learning framework for solving this challenging task that allows the agent to learn when to seek help and how to use the language instructions. AVLEN agents can interact anywhere in the 3D navigation space and demonstrate state-of-the-art performances when the audio-goal is sporadic or when distractor sounds are present.

        2. “Learning Partial Equivariances From Data” by David W. Romero and Suhas Lohit

        Group equivariance serves as a good prior improving data efficiency and generalization for deep neural networks, especially in settings with data or memory constraints. However, if the symmetry groups are misspecified, equivariance can be overly restrictive and lead to bad performance. This paper shows how to build partial group convolutional neural networks that learn to adapt the equivariance levels at each layer that are suitable for the task at hand directly from data. This improves performance while retaining equivariance properties approximately.

        3. “Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation” by Moitreya Chatterjee, Narendra Ahuja, and Anoop Cherian

        There often exist strong correlations between the 3D motion dynamics of a sounding source and its sound being heard, especially when the source is moving towards or away from the microphone. In this paper, we propose an audio-visual scene-graph that learns and leverages such correlations for improved visually-guided audio separation from an audio mixture, while also allowing predicting the direction of motion of the sound source.

        4. “What Makes a "Good" Data Augmentation in Knowledge Distillation - A Statistical Perspective” by Huan Wang, Suhas Lohit, Michael Jones, and Yun Fu

        This paper presents theoretical and practical results for understanding what makes a particular data augmentation technique (DA) suitable for knowledge distillation (KD). We design a simple metric that works very well in practice to predict the effectiveness of DA for KD. Based on this metric, we also propose a new data augmentation technique that outperforms other methods for knowledge distillation in image recognition networks.

        5. “FeLMi : Few shot Learning with hard Mixup” by Aniket Roy, Anshul Shah, Ketul Shah, Prithviraj Dhar, Anoop Cherian, and Rama Chellappa

        Learning from only a few examples is a fundamental challenge in machine learning. Recent approaches show benefits by learning a feature extractor on the abundant and labeled base examples and transferring these to the fewer novel examples. However, the latter stage is often prone to overfitting due to the small size of few-shot datasets. In this paper, we propose a novel uncertainty-based criteria to synthetically produce “hard” and useful data by mixing up real data samples. Our approach leads to state-of-the-art results on various computer vision few-shot benchmarks.
    •  

    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • CV1906: Next Generation Self-Supervised Learning

      MERL is looking for a highly motivated intern to work on an original research project in self-supervised learning. A strong background in computer vision and deep learning is required. Experience in audio-visual (multimodal) learning is an added plus and will be valued. The successful candidate is expected to have published at least one paper in a top-tier computer vision or machine learning venue, such as CVPR, ECCV, ICCV, ICML, ICLR, NeurIPS or AAAI, and possess solid programming skills in Python and popular deep learning frameworks like Pytorch. The goal would be for such a candidate to collaborate with MERL researchers to develop algorithms and prepare manuscripts for scientific publications. The position is available for graduate students on a Ph.D. track. Duration and start dates are flexible but are expected to last for at least 3 months. This internship is preferred to be onsite at MERL’s office in Cambridge, MA.

    • SA1874: Audio source separation and sound event detection

      We are seeking graduate students interested in helping advance the fields of source separation, speech enhancement, robust ASR, and sound event detection/localization in challenging multi-source and far-field scenarios. The interns will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. The ideal candidates are senior Ph.D. students with experience in some of the following: audio signal processing, microphone array processing, probabilistic modeling, sequence to sequence models, and deep learning techniques, in particular those involving minimal supervision (e.g., unsupervised, weakly-supervised, self-supervised, or few-shot learning). Multiple positions are available throughout 2023 (Spring/Summer of course, but also as early as January), with expected durations of 3-6 months and flexible start dates.

    • SA1936: Multimodal scene-understanding for Robot Dialog or Indoor Monitoring

      We are looking for a graduate student interested in helping advance the field of multi-modal scene understanding, with a focus on scene understanding using natural language for robot dialog or indoor monitoring. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. The ideal candidate would be a senior Ph.D. student with experience in deep learning for audio-visual, signal, and natural language processing. The expected duration of the internship is 3-6 months, and start date is flexible.


    See All Internships for Speech & Audio
  • Openings


    See All Openings at MERL
  • Recent Publications

    •  Venkatesh, S., Wichern, G., Subramanian, A.S., Le Roux, J., "IMPROVED DOMAIN GENERALIZATION VIA DISENTANGLED MULTI-TASK LEARNING IN UNSUPERVISED ANOMALOUS SOUND DETECTION", DCASE Workshop, Lagrange, M. and Mesaros, A. and Pellegrini, T. and Richard, G. and Serizel, R. and Stowell, D., Eds., November 2022.
      BibTeX TR2022-146 PDF
      • @inproceedings{Venkatesh2022nov,
      • author = {Venkatesh, Satvik and Wichern, Gordon and Subramanian, Aswin Shanmugam and Le Roux, Jonathan},
      • title = {IMPROVED DOMAIN GENERALIZATION VIA DISENTANGLED MULTI-TASK LEARNING IN UNSUPERVISED ANOMALOUS SOUND DETECTION},
      • booktitle = {Detection and Classification of Acoustic Scenes and Events Workshop (DCASE)},
      • year = 2022,
      • editor = {Lagrange, M. and Mesaros, A. and Pellegrini, T. and Richard, G. and Serizel, R. and Stowell, D.},
      • month = nov,
      • isbn = {978-952-03-2677-7},
      • url = {https://www.merl.com/publications/TR2022-146}
      • }
    •  Chatterjee, M., Ahuja, N., Cherian, A., "Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation", Advances in Neural Information Processing Systems (NeurIPS), November 2022.
      BibTeX TR2022-140 PDF
      • @inproceedings{Chatterjee2022nov,
      • author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
      • title = {Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation},
      • booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
      • year = 2022,
      • month = nov,
      • url = {https://www.merl.com/publications/TR2022-140}
      • }
    •  Paul, S., Roy Chowdhury, A.K., Cherian, A., "AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments", Advances in Neural Information Processing Systems (NeurIPS), October 2022.
      BibTeX TR2022-131 PDF Video
      • @inproceedings{Paul2022oct2,
      • author = {Paul, Sudipta and Roy Chowdhury, Amit K and Cherian, Anoop},
      • title = {AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments},
      • booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
      • year = 2022,
      • month = oct,
      • url = {https://www.merl.com/publications/TR2022-131}
      • }
    •  Hori, C., Hori, T., Le Roux, J., "Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers", Interspeech, DOI: 10.21437/​Interspeech.2022-10891, September 2022, pp. 4511-4515.
      BibTeX TR2022-116 PDF
      • @inproceedings{Hori2022sep,
      • author = {Hori, Chiori and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers},
      • booktitle = {Interspeech},
      • year = 2022,
      • pages = {4511--4515},
      • month = sep,
      • doi = {10.21437/Interspeech.2022-10891},
      • url = {https://www.merl.com/publications/TR2022-116}
      • }
    •  Tzinis, E., Wichern, G., Subramanian, A.S., Smaragdis, P., Le Roux, J., "Heterogeneous Target Speech Separation", Interspeech, DOI: 10.21437/​Interspeech.2022-10717, September 2022, pp. 1796-1800.
      BibTeX TR2022-115 PDF
      • @inproceedings{Tzinis2022sep,
      • author = {Tzinis, Efthymios and Wichern, Gordon and Subramanian, Aswin Shanmugam and Smaragdis, Paris and Le Roux, Jonathan},
      • title = {Heterogeneous Target Speech Separation},
      • booktitle = {Interspeech},
      • year = 2022,
      • pages = {1796--1800},
      • month = sep,
      • doi = {10.21437/Interspeech.2022-10717},
      • url = {https://www.merl.com/publications/TR2022-115}
      • }
    •  Higuchi, Y., Moritz, N., Le Roux, J., Hori, T., "Momentum Pseudo-Labeling: Semi-Supervised ASR with Continuously Improving Pseudo-Labels", IEEE Journal of Selected Topics in Signal Processing, DOI: 10.1109/​JSTSP.2022.3195367, Vol. 16, No. 6, pp. 1424-1438, September 2022.
      BibTeX TR2022-112 PDF
      • @article{Higuchi2022sep,
      • author = {Higuchi, Yosuke and Moritz, Niko and Le Roux, Jonathan and Hori, Takaaki},
      • title = {Momentum Pseudo-Labeling: Semi-Supervised ASR with Continuously Improving Pseudo-Labels},
      • journal = {IEEE Journal of Selected Topics in Signal Processing},
      • year = 2022,
      • volume = 16,
      • number = 6,
      • pages = {1424--1438},
      • month = sep,
      • doi = {10.1109/JSTSP.2022.3195367},
      • issn = {1941-0484},
      • url = {https://www.merl.com/publications/TR2022-112}
      • }
    •  Venkatesh, S., Wichern, G., Subramanian, A.S., Le Roux, J., "Disentangled surrogate task learning for improved domain generalization in unsupervised anomolous sound detection," Tech. Rep. TR2022-092, Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2022, July 2022.
      BibTeX TR2022-092 PDF Presentation
      • @techreport{Venkatesh2022jul,
      • author = {Venkatesh, Satvik and Wichern, Gordon and Subramanian, Aswin Shanmugam and Le Roux, Jonathan},
      • title = {Disentangled surrogate task learning for improved domain generalization in unsupervised anomolous sound detection},
      • institution = {DCASE2022 Challenge},
      • year = 2022,
      • month = jul,
      • url = {https://www.merl.com/publications/TR2022-092}
      • }
    •  Chatterjee, M., Ahuja, N., Cherian, A., "Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
      BibTeX TR2022-082 PDF
      • @inproceedings{Chatterjee2022jun,
      • author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
      • title = {Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio},
      • booktitle = {Sight and Sound Workshop at CVPR 2022},
      • year = 2022,
      • month = jun,
      • url = {https://www.merl.com/publications/TR2022-082}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Software Downloads