Speech & Audio
Audio source separation, recognition, and understanding.
Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.
Quick Links
-
Researchers
-
Awards
-
AWARD Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020 Date: October 15, 2020
Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
MERL Contacts: Jonathan Le Roux; Gordon Wichern
Research Areas: Artificial Intelligence, Machine Learning, Speech & AudioBrief- Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.
The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
- Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.
-
AWARD Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019 Date: December 18, 2019
Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
MERL Contact: Jonathan Le Roux
Research Areas: Artificial Intelligence, Machine Learning, Speech & AudioBrief- MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
-
AWARD Best Student Paper Award at IEEE ICASSP 2018 Date: April 17, 2018
Awarded to: Zhong-Qiu Wang
MERL Contact: Jonathan Le Roux
Research Area: Speech & AudioBrief- Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
See All Awards for Speech & Audio -
-
News & Events
-
NEWS Jonathan Le Roux gives invited talk at CMU's Language Technology Institute Colloquium Date: December 9, 2022
Where: Pittsburg, PA
MERL Contact: Jonathan Le Roux
Research Areas: Artificial Intelligence, Machine Learning, Speech & AudioBrief- MERL Senior Principal Research Scientist and Speech and Audio Senior Team Leader, Jonathan Le Roux, was invited by Carnegie Mellon University's Language Technology Institute (LTI) to give an invited talk as part of the LTI Colloquium Series. The LTI Colloquium is a prestigious series of talks given by experts from across the country related to different areas of language technologies. Jonathan's talk, entitled "Towards general and flexible audio source separation", presented an overview of techniques developed at MERL towards the goal of robustly and flexibly decomposing and analyzing an acoustic scene, describing in particular the Speech and Audio Team's efforts to extend MERL's early speech separation and enhancement methods to more challenging environments, and to more general and less supervised scenarios.
-
EVENT MERL's Virtual Open House 2022 Date & Time: Monday, December 12, 2022; 1:00pm-5:30pm ET
Location: Mitsubishi Electric Research Laboratories (MERL)/Virtual
Research Areas: Applied Physics, Artificial Intelligence, Communications, Computational Sensing, Computer Vision, Control, Data Analytics, Dynamical Systems, Electric Systems, Electronic and Photonic Devices, Machine Learning, Multi-Physical Modeling, Optimization, Robotics, Signal Processing, Speech & Audio, Digital VideoBrief- Join MERL's virtual open house on December 12th, 2022! Featuring a keynote, live sessions, research area booths, and opportunities to interact with our research team. Discover who we are and what we do, and learn about internship and employment opportunities.
See All News & Events for Speech & Audio -
-
Research Highlights
-
Openings
See All Openings at MERL -
Recent Publications
- "STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency", IEEE/ACM Transactions on Audio, Speech, and Language Processing, December 2022.BibTeX TR2022-166 PDF
- @article{Wang2022dec2,
- author = {Wang, Zhong-Qiu and Wichern, Gordon and Watanabe, Shinji and Le Roux, Jonathan},
- title = {STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency},
- journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
- year = 2022,
- month = dec,
- url = {https://www.merl.com/publications/TR2022-166}
- }
, - "Improved Domain Generalization via Disentangled Multi-Task Learning in Unsupervised Anomalous Sound Detection", DCASE Workshop, Lagrange, M. and Mesaros, A. and Pellegrini, T. and Richard, G. and Serizel, R. and Stowell, D., Eds., November 2022.BibTeX TR2022-146 PDF Presentation
- @inproceedings{Venkatesh2022nov,
- author = {Venkatesh, Satvik and Wichern, Gordon and Subramanian, Aswin Shanmugam and Le Roux, Jonathan},
- title = {Improved Domain Generalization via Disentangled Multi-Task Learning in Unsupervised Anomalous Sound Detection},
- booktitle = {DCASE Workshop},
- year = 2022,
- editor = {Lagrange, M. and Mesaros, A. and Pellegrini, T. and Richard, G. and Serizel, R. and Stowell, D.},
- month = nov,
- isbn = {978-952-03-2677-7},
- url = {https://www.merl.com/publications/TR2022-146}
- }
, - "Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation", Advances in Neural Information Processing Systems (NeurIPS), November 2022.BibTeX TR2022-140 PDF
- @inproceedings{Chatterjee2022nov,
- author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
- title = {Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation},
- booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
- year = 2022,
- month = nov,
- url = {https://www.merl.com/publications/TR2022-140}
- }
, - "AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments", Advances in Neural Information Processing Systems (NeurIPS), October 2022.BibTeX TR2022-131 PDF Video
- @inproceedings{Paul2022oct2,
- author = {Paul, Sudipta and Roy Chowdhury, Amit K and Cherian, Anoop},
- title = {AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments},
- booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
- year = 2022,
- month = oct,
- url = {https://www.merl.com/publications/TR2022-131}
- }
, - "Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers", Interspeech, DOI: 10.21437/Interspeech.2022-10891, September 2022, pp. 4511-4515.BibTeX TR2022-116 PDF
- @inproceedings{Hori2022sep,
- author = {Hori, Chiori and Hori, Takaaki and Le Roux, Jonathan},
- title = {Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers},
- booktitle = {Interspeech},
- year = 2022,
- pages = {4511--4515},
- month = sep,
- doi = {10.21437/Interspeech.2022-10891},
- url = {https://www.merl.com/publications/TR2022-116}
- }
, - "Heterogeneous Target Speech Separation", Interspeech, DOI: 10.21437/Interspeech.2022-10717, September 2022, pp. 1796-1800.BibTeX TR2022-115 PDF Video Presentation
- @inproceedings{Tzinis2022sep,
- author = {Tzinis, Efthymios and Wichern, Gordon and Subramanian, Aswin Shanmugam and Smaragdis, Paris and Le Roux, Jonathan},
- title = {Heterogeneous Target Speech Separation},
- booktitle = {Interspeech},
- year = 2022,
- pages = {1796--1800},
- month = sep,
- doi = {10.21437/Interspeech.2022-10717},
- url = {https://www.merl.com/publications/TR2022-115}
- }
, - "Momentum Pseudo-Labeling: Semi-Supervised ASR with Continuously Improving Pseudo-Labels", IEEE Journal of Selected Topics in Signal Processing, DOI: 10.1109/JSTSP.2022.3195367, Vol. 16, No. 6, pp. 1424-1438, September 2022.BibTeX TR2022-112 PDF
- @article{Higuchi2022sep,
- author = {Higuchi, Yosuke and Moritz, Niko and Le Roux, Jonathan and Hori, Takaaki},
- title = {Momentum Pseudo-Labeling: Semi-Supervised ASR with Continuously Improving Pseudo-Labels},
- journal = {IEEE Journal of Selected Topics in Signal Processing},
- year = 2022,
- volume = 16,
- number = 6,
- pages = {1424--1438},
- month = sep,
- doi = {10.1109/JSTSP.2022.3195367},
- issn = {1941-0484},
- url = {https://www.merl.com/publications/TR2022-112}
- }
, - "Disentangled Surrogate Task Learning for Improved Domain Generalization in Unsupervised Anomolous Sound Detection," Tech. Rep. TR2022-092, Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2022, July 2022.BibTeX TR2022-092 PDF Presentation
- @techreport{Venkatesh2022jul,
- author = {Venkatesh, Satvik and Wichern, Gordon and Subramanian, Aswin Shanmugam and Le Roux, Jonathan},
- title = {Disentangled Surrogate Task Learning for Improved Domain Generalization in Unsupervised Anomolous Sound Detection},
- institution = {DCASE2022 Challenge},
- year = 2022,
- month = jul,
- url = {https://www.merl.com/publications/TR2022-092}
- }
,
- "STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency", IEEE/ACM Transactions on Audio, Speech, and Language Processing, December 2022.
-
Videos
-
Human Perspective Scene Understanding via Multimodal Sensing
-
[MERL Seminar Series Spring 2022] Learning Speech Representations with Multimodal Self-Supervision
-
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks
-
[MERL Seminar Series 2021] Look and Listen: From Semantic to Spatial Audio-Visual Perception
-
Scene-Aware Interaction Technology
-
Seamless Speech Recognition
-
Speech Enhancement and Separation
-
-
Software Downloads