Speech & Audio
Audio source separation, recognition, and understanding.
Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.
Quick Links
-
Researchers
-
Awards
-
AWARD Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020 Date: October 15, 2020
Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
MERL Contacts: Jonathan Le Roux; Gordon Wichern
Research Areas: Artificial Intelligence, Machine Learning, Speech & AudioBrief- Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.
The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
- Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.
-
AWARD Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019 Date: December 18, 2019
Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
MERL Contact: Jonathan Le Roux
Research Areas: Artificial Intelligence, Machine Learning, Speech & AudioBrief- MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
-
AWARD Best Student Paper Award at IEEE ICASSP 2018 Date: April 17, 2018
Awarded to: Zhong-Qiu Wang
MERL Contact: Jonathan Le Roux
Research Area: Speech & AudioBrief- Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
See All Awards for Speech & Audio -
-
News & Events
-
NEWS Chiori Hori will give keynote on scene understanding via multimodal sensing at AI Electronics Symposium Date: February 15, 2021
Where: The 2nd International Symposium on AI Electronics
MERL Contact: Chiori Hori
Research Areas: Artificial Intelligence, Computer Vision, Machine Learning, Speech & AudioBrief- Chiori Hori, a Senior Principal Researcher in MERL's Speech and Audio Team, will be a keynote speaker at the 2nd International Symposium on AI Electronics, alongside Alex Acero, Senior Director of Apple Siri, Roberto Cipolla, Professor of Information Engineering at the University of Cambridge, and Hiroshi Amano, Professor at Nagoya University and winner of the Nobel prize in Physics for his work on blue light-emitting diodes. The symposium, organized by Tohoku University, will be held online on February 15, 2021, 10am-4pm (JST).
Chiori's talk, titled "Human Perspective Scene Understanding via Multimodal Sensing", will present MERL's work towards the development of scene-aware interaction. One important piece of technology that is still missing for human-machine interaction is natural and context-aware interaction, where machines understand their surrounding scene from the human perspective, and they can share their understanding with humans using natural language. To bridge this communications gap, MERL has been working at the intersection of research fields such as spoken dialog, audio-visual understanding, sensor signal understanding, and robotics technologies in order to build a new AI paradigm, called scene-aware interaction, that enables machines to translate their perception and understanding of a scene and respond to it using natural language to interact more effectively with humans. In this talk, the technologies will be surveyed, and an application for future car navigation will be introduced.
- Chiori Hori, a Senior Principal Researcher in MERL's Speech and Audio Team, will be a keynote speaker at the 2nd International Symposium on AI Electronics, alongside Alex Acero, Senior Director of Apple Siri, Roberto Cipolla, Professor of Information Engineering at the University of Cambridge, and Hiroshi Amano, Professor at Nagoya University and winner of the Nobel prize in Physics for his work on blue light-emitting diodes. The symposium, organized by Tohoku University, will be held online on February 15, 2021, 10am-4pm (JST).
-
EVENT MERL Virtual Open House 2020 Date & Time: Wednesday, December 9, 2020; 1:00-5:00PM EST
MERL Contacts: Elizabeth Phillips; Jeroen van Baar; Anthony Vetro
Location: Virtual
Research Areas: Applied Physics, Artificial Intelligence, Communications, Computational Sensing, Computer Vision, Control, Data Analytics, Dynamical Systems, Electric Systems, Electronic and Photonic Devices, Machine Learning, Multi-Physical Modeling, Optimization, Robotics, Signal Processing, Speech & AudioBrief- MERL will host a virtual open house on December 9, 2020. Live sessions will be held from 1-5pm EST, including an overview of recent activities by our research groups and a talk by Prof. Pierre Moulin of University of Illinois at Urbana-Champaign on adversarial machine learning. Registered attendees will also be able to browse our virtual booths at their convenience and connect with our research staff on engagement opportunities including internship, post-doc and research scientist openings, as well as visiting faculty positions.
Registration: https://mailchi.mp/merl/merl-virtual-open-house-2020
Schedule: https://www.merl.com/events/voh20
Current internship and employment openings:
https://www.merl.com/internship/openings
https://www.merl.com/employment/employment
Information about working at MERL:
https://www.merl.com/employment
- MERL will host a virtual open house on December 9, 2020. Live sessions will be held from 1-5pm EST, including an overview of recent activities by our research groups and a talk by Prof. Pierre Moulin of University of Illinois at Urbana-Champaign on adversarial machine learning. Registered attendees will also be able to browse our virtual booths at their convenience and connect with our research staff on engagement opportunities including internship, post-doc and research scientist openings, as well as visiting faculty positions.
See All News & Events for Speech & Audio -
-
Research Highlights
-
Internships
-
SA1611: Audio source separation and sound event detection
We are seeking a graduate student interested in helping advance the fields of source separation, speech enhancement, and sound event detection/localization in challenging multi-source and far-field scenarios. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. The ideal candidate would be a senior Ph.D. student with experience in audio signal processing, microphone array processing, probabilistic modeling, and deep learning techniques requiring minimal supervision (e.g., unsupervised, weakly-supervised, self-supervised, or few shot learning). The internship will take place during fall/winter 2021 with an expected duration of 3-6 months and a flexible start date. This internship is preferred to be onsite at MERL, but may be done remotely where you live if the COVID pandemic makes it necessary.
-
SA1612: End-to-end speech and audio processing
MERL is looking for interns to work on fundamental research in the area of end-to-end speech and audio processing for new and challenging environments using advanced machine learning techniques. The intern will collaborate with MERL researchers to derive and implement new models and learning methods, conduct experiments, and prepare results for high-impact publication. The ideal candidates would be senior Ph.D. students with experience in one or more of automatic speech recognition, speech enhancement, sound event detection, and natural language processing, including good theoretical and practical knowledge of relevant machine learning algorithms with related programming skills. The internship will take place during fall/winter 2021 with an expected duration of 3-6 months and a flexible start date. This internship is preferred to be onsite at MERL, but may be done remotely where you live if the COVID pandemic makes it necessary.
See All Internships for Speech & Audio -
-
Openings
See All Openings at MERL -
Recent Publications
- "Transformer-based Long-context End-to-end Speech Recognition", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/Interspeech.2020-2928, October 2020, pp. 5011-5015.BibTeX TR2020-139 PDF
- @inproceedings{Hori2020oct,
- author = {Hori, Takaaki and Moritz, Niko and Hori, Chiori and Le Roux, Jonathan},
- title = {Transformer-based Long-context End-to-end Speech Recognition},
- booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
- year = 2020,
- pages = {5011--5015},
- month = oct,
- doi = {10.21437/Interspeech.2020-2928},
- issn = {1990-9772},
- url = {https://www.merl.com/publications/TR2020-139}
- }
, - "Detecting Audio Attacks on ASR Systems with Dropout Uncertainty", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/Interspeech.2020-1846, October 2020, pp. 4671-4675.BibTeX TR2020-137 PDF
- @inproceedings{Jayashankar2020oct,
- author = {Jayashankar, Tejas and Le Roux, Jonathan and Moulin, Pierre},
- title = {Detecting Audio Attacks on ASR Systems with Dropout Uncertainty},
- booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
- year = 2020,
- pages = {4671--4675},
- month = oct,
- doi = {10.21437/Interspeech.2020-1846},
- issn = {1990-9772},
- url = {https://www.merl.com/publications/TR2020-137}
- }
, - "All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/Interspeech.2020-2757, October 2020, pp. 3112-3116.BibTeX TR2020-138 PDF
- @inproceedings{Moritz2020oct,
- author = {Moritz, Niko and Wichern, Gordon and Hori, Takaaki and Le Roux, Jonathan},
- title = {All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection},
- booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
- year = 2020,
- pages = {3112--3116},
- month = oct,
- doi = {10.21437/Interspeech.2020-2757},
- issn = {1990-9772},
- url = {https://www.merl.com/publications/TR2020-138}
- }
, - "Hierarchical Musical Instrument Separation", International Society for Music Information Retrieval (ISMIR) Conference, October 2020, pp. 376-383.BibTeX TR2020-136 PDF
- @inproceedings{Manilow2020oct,
- author = {Manilow, Ethan and Wichern, Gordon and Le Roux, Jonathan},
- title = {Hierarchical Musical Instrument Separation},
- booktitle = {International Society for Music Information Retrieval (ISMIR) Conference},
- year = 2020,
- pages = {376--383},
- month = oct,
- isbn = {978-0-9813537-0-8},
- url = {https://www.merl.com/publications/TR2020-136}
- }
, - "Autoclip: Adaptive Gradient Clipping For Source Separation Networks", IEEE International Workshop on Machine Learning for Signal Processing (MLSP), DOI: https://doi.org/10.1109/MLSP49062.2020.9231926, September 2020.BibTeX TR2020-132 PDF
- @inproceedings{Seetharaman2020sep,
- author = {Seetharaman, Prem and Wichern, Gordon and Pardo, Bryan and Le Roux, Jonathan},
- title = {Autoclip: Adaptive Gradient Clipping For Source Separation Networks},
- booktitle = {IEEE International Workshop on Machine Learning for Signal Processing (MLSP)},
- year = 2020,
- month = sep,
- publisher = {IEEE},
- doi = {https://doi.org/10.1109/MLSP49062.2020.9231926},
- url = {https://www.merl.com/publications/TR2020-132}
- }
, - "Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision", IEEE/ACM Transactions on Audio, Speech, and Language Processing, DOI: 10.1109/TASLP.2020.3013105, Vol. 28, pp. 2386-2399, September 2020.BibTeX TR2020-126 PDF
- @article{Pishdadian2020sep,
- author = {Pishdadian, Fatemeh and Wichern, Gordon and Le Roux, Jonathan},
- title = {Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision},
- journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
- year = 2020,
- volume = 28,
- pages = {2386--2399},
- month = sep,
- doi = {10.1109/TASLP.2020.3013105},
- url = {https://www.merl.com/publications/TR2020-126}
- }
, - "Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles", ICML 2020 Workshop on Self-supervision in Audio and Speech, July 2020.BibTeX TR2020-111 PDF
- @inproceedings{Seetharaman2020jul,
- author = {Seetharaman, Prem and Wichern, Gordon and Le Roux, Jonathan and Pardo, Bryan},
- title = {Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles},
- booktitle = {ICML 2020 Workshop on Self-supervision in Audio and Speech},
- year = 2020,
- month = jul,
- url = {https://www.merl.com/publications/TR2020-111}
- }
, - "End-To-End Multi-Speaker Speech Recognition with Transformer", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9054029, April 2020, pp. 6134-6138.BibTeX TR2020-043 PDF Video
- @inproceedings{Chang2020apr,
- author = {Chang, Xuankai and Zhang, Wangyou and Qian, Yanmin and Le Roux, Jonathan and Watanabe, Shinji},
- title = {End-To-End Multi-Speaker Speech Recognition with Transformer},
- booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
- year = 2020,
- pages = {6134--6138},
- month = apr,
- publisher = {IEEE},
- doi = {10.1109/ICASSP40776.2020.9054029},
- issn = {2379-190X},
- isbn = {978-1-5090-6631-5},
- url = {https://www.merl.com/publications/TR2020-043}
- }
,
- "Transformer-based Long-context End-to-end Speech Recognition", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/Interspeech.2020-2928, October 2020, pp. 5011-5015.
-
Videos
-
Software Downloads