Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD    MERL Intern and Researchers Win ICASSP 2023 Best Student Paper Award
      Date: June 9, 2023
      Awarded to: Darius Petermann, Gordon Wichern, Aswin Subramanian, Jonathan Le Roux
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • Former MERL intern Darius Petermann (Ph.D. Candidate at Indiana University) has received a Best Student Paper Award at the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023) for the paper "Hyperbolic Audio Source Separation", co-authored with MERL researchers Gordon Wichern and Jonathan Le Roux, and former MERL researcher Aswin Subramanian. The paper presents work performed during Darius's internship at MERL in the summer 2022. The paper introduces a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Additionally, the code associated with the paper is publicly available at https://github.com/merlresearch/hyper-unmix.

        ICASSP is the flagship conference of the IEEE Signal Processing Society (SPS). ICASSP 2023 was held in the Greek island of Rhodes from June 04 to June 10, 2023, and it was the largest ICASSP in history, with more than 4000 participants, over 6128 submitted papers and 2709 accepted papers. Darius’s paper was first recognized as one of the Top 3% of all papers accepted at the conference, before receiving one of only 5 Best Student Paper Awards during the closing ceremony.
    •  
    •  AWARD    Joint CMU-MERL team wins DCASE2023 Challenge on Automated Audio Captioning
      Date: June 1, 2023
      Awarded to: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, Francois Germain, Jonathan Le Roux, Shinji Watanabe
      MERL Contacts: Francois Germain; Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • A joint team consisting of members of CMU Professor and MERL Alumn Shinji Watanabe's WavLab and members of MERL's Speech & Audio team ranked 1st out of 11 teams in the DCASE2023 Challenge's Task 6A "Automated Audio Captioning". The team was led by student Shih-Lun Wu and also featured Ph.D. candidate Xuankai Chang, Postdoctoral research associate Jee-weon Jung, Prof. Shinji Watanabe, and MERL researchers Gordon Wichern, Francois Germain, and Jonathan Le Roux.

        The IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE Challenge), started in 2013, has been organized yearly since 2016, and gathers challenges on multiple tasks related to the detection, analysis, and generation of sound events. This year, the DCASE2023 Challenge received over 428 submissions from 123 teams across seven tasks.

        The CMU-MERL team competed in the Task 6A track, Automated Audio Captioning, which aims at generating informative descriptions for various sounds from nature and/or human activities. The team's system made strong use of large pretrained models, namely a BEATs transformer as part of the audio encoder stack, an Instructor Transformer encoding ground-truth captions to derive an audio-text contrastive loss on the audio encoder, and ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of the training data. The team's best submission obtained a SPIDEr-FL score of 0.327 on the hidden test set, largely outperforming the 2nd best team's 0.315.
    •  
    •  AWARD    Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020
      Date: October 15, 2020
      Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.

        The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
    •  

    See All Awards for Speech & Audio
  • News & Events

    •  NEWS    MERL co-organizes the 2023 Sound Demixing (SDX2023) Challenge and Workshop
      Date: January 23, 2023 - November 4, 2023
      Where: International Symposium of Music Information Retrieval (ISMR)
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL Speech & Audio team members Gordon Wichern and Jonathan Le Roux co-organized the 2023 Sound Demixing Challenge along with researchers from Sony, Moises AI, Audioshake, and Meta.

        The SDX2023 Challenge was hosted on the AI Crowd platform and had a prize pool of $42,000 distributed to the winning teams across two tracks: Music Demixing and Cinematic Sound Demixing. A unique aspect of this challenge was the ability to test the audio source separation models developed by challenge participants on non-public songs from Sony Music Entertainment Japan for the music demixing track, and movie soundtracks from Sony Pictures for the cinematic sound demixing track. The challenge ran from January 23rd to May 1st, 2023, and had 884 participants distributed across 68 teams submitting 2828 source separation models. The winners will be announced at the SDX2023 Workshop, which will take place as a satellite event at the International Symposium of Music Information Retrieval (ISMR) in Milan, Italy on November 4, 2023.

        MERL’s contribution to SDX2023 focused mainly on the cinematic demixing track. In addition to sponsoring the prizes awarded to the winning teams for that track, the baseline system and initial training data were MERL’s Cocktail Fork separation model and Divide and Remaster dataset, respectively. MERL researchers also contributed to a Town Hall kicking off the challenge, co-authored a scientific paper describing the challenge outcomes, and co-organized the SDX2023 Workshop.
    •  
    •  TALK    [MERL Seminar Series 2023] Prof. Komei Sugiura presents talk titled The Confluence of Vision, Language, and Robotics
      Date & Time: Thursday, September 28, 2023; 12:00 PM
      Speaker: Komei Sugiura, Keio University
      MERL Host: Chiori Hori
      Research Areas: Artificial Intelligence, Machine Learning, Robotics, Speech & Audio
      Abstract
      • Recent advances in multimodal models that fuse vision and language are revolutionizing robotics. In this lecture, I will begin by introducing recent multimodal foundational models and their applications in robotics. The second topic of this talk will address our recent work on multimodal language processing in robotics. The shortage of home care workers has become a pressing societal issue, and the use of domestic service robots (DSRs) to assist individuals with disabilities is seen as a possible solution. I will present our work on DSRs that are capable of open-vocabulary mobile manipulation, referring expression comprehension and segmentation models for everyday objects, and future captioning methods for cooking videos and DSRs.
    •  

    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • SA2074: Audio source separation and generation

      We are seeking graduate students interested in helping advance the fields of generative audio, source separation, speech enhancement, and robust ASR in challenging multi-source and far-field scenarios. The interns will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. Internships regularly lead to one or more publications in top-tier venues, which can later become part of the intern''s doctoral work. The ideal candidates are senior Ph.D. students with experience in some of the following: audio signal processing, microphone array processing, probabilistic modeling, sequence to sequence models, and generative modeling techniques, in particular those involving minimal supervision (e.g., unsupervised, weakly-supervised, self-supervised, or few-shot learning). Multiple positions are available with flexible start date (not just Spring/Summer but throughout 2024) and duration (typically 3-6 months).

    • SA2067: Sound event and anomaly detection

      We are seeking graduate students interested in helping advance the fields of sound event detection/localization and sound anomaly detection. The interns will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. Internships regularly lead to one or more publications in top-tier venues, which may later become part of the intern''s doctoral work. The ideal candidates are senior Ph.D. students with experience in some of the following: audio signal processing, microphone array processing, probabilistic modeling, sequence to sequence models, and deep learning techniques, in particular those involving minimal supervision (e.g., unsupervised, weakly-supervised, self-supervised, or few-shot learning). Multiple positions are available with flexible start date (not just Spring/Summer but throughout 2024) and duration (typically 3-6 months).

    • SA2073: Multimodal scene-understanding

      We are looking for a graduate student interested in helping advance the field of multimodal scene understanding, with a focus on scene understanding using natural language for robot dialog and/or indoor monitoring using a large language model. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. Internships regularly lead to one or more publications in top-tier venues, which can later become part of the intern''s doctoral work. The ideal candidates are senior Ph.D. students with experience in deep learning for audio-visual, signal, and natural language processing. Good programming skills in Python and knowledge of deep learning frameworks such as PyTorch are essential. Multiple positions are available with flexible start date (not just Spring/Summer but throughout 2024) and duration (typically 3-6 months).


    See All Internships for Speech & Audio
  • Recent Publications

    •  Falcon Perez, R., Wichern, G., Germain, F., Le Roux, J., "Location as supervision for weakly supervised multi-channel source separation of machine sounds", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), DOI: 10.1109/​WASPAA58266.2023.10248128, September 2023.
      BibTeX TR2023-119 PDF Presentation
      • @inproceedings{FalconPerez2023aug,
      • author = {Falcon Perez, Ricardo and Wichern, Gordon and Germain, Francois and Le Roux, Jonathan},
      • title = {Location as supervision for weakly supervised multi-channel source separation of machine sounds},
      • booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
      • year = 2023,
      • month = sep,
      • publisher = {IEEE},
      • doi = {10.1109/WASPAA58266.2023.10248128},
      • issn = {1947-1629},
      • isbn = {979-8-3503-2372-6},
      • url = {https://www.merl.com/publications/TR2023-119}
      • }
    •  Germain, F., Wichern, G., Le Roux, J., "Hyperbolic Unsupervised Anomalous Sound Detection", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), DOI: 10.1109/​WASPAA58266.2023.10248092, September 2023.
      BibTeX TR2023-108 PDF Video Presentation
      • @inproceedings{Germain2023aug,
      • author = {Germain, Francois and Wichern, Gordon and Le Roux, Jonathan},
      • title = {Hyperbolic Unsupervised Anomalous Sound Detection},
      • booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
      • year = 2023,
      • month = sep,
      • publisher = {IEEE},
      • doi = {10.1109/WASPAA58266.2023.10248092},
      • issn = {1947-1629},
      • isbn = {979-8-3503-2372-6},
      • url = {https://www.merl.com/publications/TR2023-108}
      • }
    •  Petermann, D., Wichern, G., Subramanian, A.S., Wang, Z.-Q., Le Roux, J., "Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks", IEEE/ACM Transactions on Audio, Speech, and Language Processing, DOI: 10.1109/​TASLP.2023.3290428, Vol. 31, pp. 2592-2605, September 2023.
      BibTeX TR2023-113 PDF
      • @article{Petermann2023sep,
      • author = {Petermann, Darius and Wichern, Gordon and Subramanian, Aswin Shanmugam and Wang, Zhong-Qiu and Le Roux, Jonathan},
      • title = {Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks},
      • journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
      • year = 2023,
      • volume = 31,
      • pages = {2592--2605},
      • month = sep,
      • doi = {10.1109/TASLP.2023.3290428},
      • issn = {2329-9304},
      • url = {https://www.merl.com/publications/TR2023-113}
      • }
    •  Yoshino, K., Chen, Y.-N., Crook, P., Kottur, S., Li, J., Hedayatnia, B., Moon, S., Fe, Z., Li, Z., Zhang, J., Fen, Y., Zhou, J., Kim, S., Liu, Y., Jin, D., Papangelis, A., Gopalakrishnan, K., Hakkani-Tur, D., Damavandi, B., Geramifard, A., <br /><br /> Hori, C., Shah, A., Zhang, C., Li, H., Sedoc, J., D’Haro, L.F., Banchs, R., Rudnicky, A., "Overview of the Tenth Dialog System Technology Challenge: DSTC10", IEE/ACM Transactions on Audio, Speech, and Language Processing, DOI: 10.1109/​TASLP.2023.3293030, pp. 1-14, August 2023.
      BibTeX TR2023-109 PDF
      • @article{Yoshino2023aug,
      • author = {Yoshino, Koichiro and Chen, Yun-Nung and Crook, Paul and Kottur, Satwik and Li, Jinchao and Hedayatnia, Behnam and Moon, Seungwhan and Fe, Zhengcong and Li, Zekang and Zhang, Jinchao and Fen, Yang and Zhou, Jie and Kim, Seokhwan and Liu, Yang and Jin, Di and Papangelis, Alexandros and Gopalakrishnan, Karthik and Hakkani-Tur, Dilek and Damavandi, Babak and Geramifard, Alborz and

        Hori, Chiori and Shah, Ankit and Zhang, Chen and Li, Haizhou and Sedoc, João and D’Haro, Luis F. and Banchs, Rafael and Rudnicky, Alexander},
      • title = {Overview of the Tenth Dialog System Technology Challenge: DSTC10},
      • journal = {IEE/ACM Transactions on Audio, Speech, and Language Processing},
      • year = 2023,
      • pages = {1--14},
      • month = aug,
      • doi = {10.1109/TASLP.2023.3293030},
      • issn = {2329-9290},
      • url = {https://www.merl.com/publications/TR2023-109}
      • }
    •  Hori, C., Peng, P., Harwath, D., Liu, X., Ota, K., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos", Interspeech, DOI: 10.21437/​Interspeech.2023-1983, August 2023, pp. 4663-4667.
      BibTeX TR2023-104 PDF
      • @inproceedings{Hori2023aug,
      • author = {Hori, Chiori and Peng, Puyuang and Harwath, David and Liu, Xinyu and Ota, Kei and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and Le Roux, Jonathan},
      • title = {Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos},
      • booktitle = {Interspeech},
      • year = 2023,
      • pages = {4663--4667},
      • month = aug,
      • doi = {10.21437/Interspeech.2023-1983},
      • url = {https://www.merl.com/publications/TR2023-104}
      • }
    •  Wu, S.-L., Chang, X., Wichern, G., Jung, J.-W., Germain, F., Le Roux, J., Watanabe, S., "BEATs-based Audio Captioning Model with Instructor Embedding Supervision and ChatGPT Mix-up," Tech. Rep. TR2023-068, DCASE2023 Challenge, May 2023.
      BibTeX TR2023-068 PDF
      • @techreport{Wu2023may,
      • author = {Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Francois and Le Roux, Jonathan and Watanabe, Shinji},
      • title = {BEATs-based Audio Captioning Model with Instructor Embedding Supervision and ChatGPT Mix-up},
      • institution = {DCASE2023 Challenge},
      • year = 2023,
      • month = may,
      • url = {https://www.merl.com/publications/TR2023-068}
      • }
    •  Zhang, J., Cherian, A., Liu, Y., Shabat, I.B., Rodriguez, C., Gould, S., "Aligning Step-by-Step Instructional Diagrams to Video Demonstrations", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), May 2023, pp. 2483-2492.
      BibTeX TR2023-034 PDF
      • @inproceedings{Zhang2023may,
      • author = {Zhang, Jiahao and Cherian, Anoop and Liu, Yanbin and Shabat, Itzik Ben and Rodriguez, Cristian and Gould, Stephen},
      • title = {Aligning Step-by-Step Instructional Diagrams to Video Demonstrations},
      • booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      • year = 2023,
      • pages = {2483--2492},
      • month = may,
      • publisher = {CVF},
      • url = {https://www.merl.com/publications/TR2023-034}
      • }
    •  Aralikatti, R., Boeddeker, C., Wichern, G., Subramanian, A.S., Le Roux, J., "Reverberation as Supervision for Speech Separation", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP49357.2023.10095022, May 2023, pp. 1-5.
      BibTeX TR2023-016 PDF
      • @inproceedings{Aralikatti2023may,
      • author = {Aralikatti, Rohith and Boeddeker, Christoph and Wichern, Gordon and Subramanian, Aswin Shanmugam and Le Roux, Jonathan},
      • title = {Reverberation as Supervision for Speech Separation},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2023,
      • pages = {1--5},
      • month = may,
      • publisher = {IEEE},
      • doi = {10.1109/ICASSP49357.2023.10095022},
      • url = {https://www.merl.com/publications/TR2023-016}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Software & Data Downloads