Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD   Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
      Date: December 18, 2019
      Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
    •  
    •  AWARD   Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to: Zhong-Qiu Wang
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  
    •  AWARD   MERL's Speech Team Achieves World's 2nd Best Performance at the Third CHiME Speech Separation and Recognition Challenge
      Date: December 15, 2015
      Awarded to: John R. Hershey, Takaaki Hori, Jonathan Le Roux and Shinji Watanabe
      MERL Contacts: Takaaki Hori; Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • The results of the third 'CHiME' Speech Separation and Recognition Challenge were publicly announced on December 15 at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) held in Scottsdale, Arizona, USA. MERL's Speech and Audio Team, in collaboration with SRI, ranked 2nd out of 26 teams from Europe, Asia and the US. The task this year was to recognize speech recorded using a tablet in real environments such as cafes, buses, or busy streets. Due to the high levels of noise and the distance from the speaker's mouth to the microphones, this is very challenging task, where the baseline system only achieved 33.4% word error rate. The MERL/SRI system featured state-of-the-art techniques including multi-channel front-end, noise-robust feature extraction, and deep learning for speech enhancement, acoustic modeling, and language modeling, leading to a dramatic 73% reduction in word error rate, down to 9.1%. The core of the system has since been released as a new official challenge baseline for the community to use.
    •  

    See All Awards for Speech & Audio
  • News & Events

    •  NEWS   Anoop Cherian gave an invited talk at the Multi-modal Video Analysis Workshop, ECCV 2020
      Date: August 23, 2020
      Where: European Conference on Computer Vision (ECCV), online, 2020
      MERL Contact: Anoop Cherian
      Research Areas: Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio
      Brief
      • MERL Principal Research Scientist Anoop Cherian gave an invited talk titled "Sound2Sight: Audio-Conditioned Visual Imagination" at the Multi-modal Video Analysis workshop held in conjunction with the European Conference on Computer Vision (ECCV), 2020. The talk was based on a recent ECCV paper that describes a new multimodal reasoning task called Sound2Sight and a generative adversarial machine learning algorithm for producing plausible video sequences conditioned on sound and visual context.
    •  
    •  NEWS   MERL's Scene-Aware Interaction Technology Featured in Mitsubishi Electric Corporation Press Release
      Date: July 22, 2020
      Where: Tokyo, Japan
      MERL Contacts: Siheng Chen; Anoop Cherian; Bret Harsham; Chiori Hori; Takaaki Hori; Jonathan Le Roux; Tim Marks; Alan Sullivan; Anthony Vetro
      Research Areas: Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio
      Brief
      • Mitsubishi Electric Corporation announced that the company has developed what it believes to be the world’s first technology capable of highly natural and intuitive interaction with humans based on a scene-aware capability to translate multimodal sensing information into natural language.

        The novel technology, Scene-Aware Interaction, incorporates Mitsubishi Electric’s proprietary Maisart® compact AI technology to analyze multimodal sensing information for highly natural and intuitive interaction with humans through context-dependent generation of natural language. The technology recognizes contextual objects and events based on multimodal sensing information, such as images and video captured with cameras, audio information recorded with microphones, and localization information measured with LiDAR.

        Scene-Aware Interaction for car navigation, one target application, will provide drivers with intuitive route guidance. The technology is also expected to have applicability to human-machine interfaces for in-vehicle infotainment, interaction with service robots in building and factory automation systems, systems that monitor the health and well-being of people, surveillance systems that interpret complex scenes for humans and encourage social distancing, support for touchless operation of equipment in public areas, and much more. The technology is based on recent research by MERL's Speech & Audio and Computer Vision groups.


        Demonstration Video:



        Link:

        Mitsubishi Electric Corporation Press Release
    •  

    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • SA1464: Joint localization and classification of sound events

      We are seeking a graduate student interested in helping advance the field of multi-channel sound localization and classification using acoustic sensor networks in challenging multi-source and far-field scenarios. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. The ideal candidate would be a senior Ph.D. student with experience in audio signal processing, beamforming/array processing, probabilistic modeling, and deep learning. The internship will be performed remotely, and candidates both from within the US and outside of the US are welcome to apply. The expected duration of the (virtual) internship is 3-6 months with a start date between Fall 2020 and early 2021.


    See All Internships for Speech & Audio
  • Recent Publications

    •  Pishdadian, F., Wichern, G., Le Roux, J., "Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision", IEEE/ACM Transactions on Audio, Speech, and Language Processing, September 2020.
      BibTeX TR2020-126 PDF
      • @article{Pishdadian2020sep,
      • author = {Pishdadian, Fatemeh and Wichern, Gordon and Le Roux, Jonathan},
      • title = {Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision},
      • journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
      • year = 2020,
      • month = sep,
      • url = {https://www.merl.com/publications/TR2020-126}
      • }
    •  Seetharaman, P., Wichern, G., Le Roux, J., Pardo, B., "Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles", ICML 2020 Workshop on Self-supervision in Audio and Speech, July 2020.
      BibTeX TR2020-111 PDF
      • @inproceedings{Seetharaman2020jul,
      • author = {Seetharaman, Prem and Wichern, Gordon and Le Roux, Jonathan and Pardo, Bryan},
      • title = {Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles},
      • booktitle = {ICML 2020 Workshop on Self-supervision in Audio and Speech},
      • year = 2020,
      • month = jul,
      • url = {https://www.merl.com/publications/TR2020-111}
      • }
    •  Chang, X., Zhang, W., Qian, Y., Le Roux, J., Watanabe, S., "End-To-End Multi-Speaker Speech Recognition with Transformer", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9054029, April 2020, pp. 6134-6138.
      BibTeX TR2020-043 PDF Video
      • @inproceedings{Chang2020apr,
      • author = {Chang, Xuankai and Zhang, Wangyou and Qian, Yanmin and Le Roux, Jonathan and Watanabe, Shinji},
      • title = {End-To-End Multi-Speaker Speech Recognition with Transformer},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2020,
      • pages = {6134--6138},
      • month = apr,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP40776.2020.9054029},
      • issn = {2379-190X},
      • isbn = {978-1-5090-6631-5},
      • url = {https://www.merl.com/publications/TR2020-043}
      • }
    •  Maciejewski, M., Wichern, G., McQuinn, E., Le Roux, J., "WHAMR!: Noisy and Reverberant Single-Channel Speech Separation", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9053327, April 2020, pp. 696-700.
      BibTeX TR2020-042 PDF Video
      • @inproceedings{Maciejewski2020apr,
      • author = {Maciejewski, Matthew and Wichern, Gordon and McQuinn, Emmett and Le Roux, Jonathan},
      • title = {WHAMR!: Noisy and Reverberant Single-Channel Speech Separation},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2020,
      • pages = {696--700},
      • month = apr,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP40776.2020.9053327},
      • issn = {2379-190X},
      • isbn = {978-1-5090-6631-5},
      • url = {https://www.merl.com/publications/TR2020-042}
      • }
    •  Moritz, N., Hori, T., Le Roux, J., "Streaming Automatic Speech Recognition With The Transformer Model", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9054476, April 2020, pp. 6074-6078.
      BibTeX TR2020-040 PDF Video
      • @inproceedings{Moritz2020apr,
      • author = {Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Streaming Automatic Speech Recognition With The Transformer Model},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2020,
      • pages = {6074--6078},
      • month = apr,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP40776.2020.9054476},
      • issn = {2379-190X},
      • isbn = {978-1-5090-6631-5},
      • url = {https://www.merl.com/publications/TR2020-040}
      • }
    •  Pishdadian, F., Wichern, G., Le Roux, J., "Learning to Separate Sounds From Weakly Labeled Scenes", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9053055, April 2020, pp. 91-95.
      BibTeX TR2020-038 PDF Video
      • @inproceedings{Pishdadian2020apr,
      • author = {Pishdadian, Fatemeh and Wichern, Gordon and Le Roux, Jonathan},
      • title = {Learning to Separate Sounds From Weakly Labeled Scenes},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2020,
      • pages = {91--95},
      • month = apr,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP40776.2020.9053055},
      • issn = {2379-190X},
      • isbn = {978-1-5090-6631-5},
      • url = {https://www.merl.com/publications/TR2020-038}
      • }
    •  Sari, L., Moritz, N., Hori, T., Le Roux, J., "Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory For End-To-End ASR", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9054249, April 2020, pp. 7384-7388.
      BibTeX TR2020-037 PDF Video
      • @inproceedings{Sari2020apr,
      • author = {Sari, Leda and Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory For End-To-End ASR},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2020,
      • pages = {7384--7388},
      • month = apr,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP40776.2020.9054249},
      • issn = {2379-190X},
      • isbn = {978-1-5090-6631-5},
      • url = {https://www.merl.com/publications/TR2020-037}
      • }
    •  Shi, L., Geng, S., Shuang, K., Hori, C., Liu, S., Gao, P., Su, S., "Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9053595, April 2020, pp. 4412-4416.
      BibTeX TR2020-046 PDF
      • @inproceedings{Shi2020apr,
      • author = {Shi, Lei and Geng, Shijie and Shuang, Kai and Hori, Chiori and Liu, Songxiang and Gao, Peng and Su, Sen},
      • title = {Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2020,
      • pages = {4412--4416},
      • month = apr,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP40776.2020.9053595},
      • issn = {2379-190X},
      • isbn = {978-1-5090-6631-5},
      • url = {https://www.merl.com/publications/TR2020-046}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Software Downloads