TR2021-069

Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

- Hung, Y.-N., Wichern, G., Le Roux, J., "Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP39728.2021.9413358, June 2021, pp. 46-50.
  BibTeX TR2021-069 PDF
  - @inproceedings{Hung2021jun,
  - author = {Hung, Yun-Ning and Wichern, Gordon and {Le Roux}, Jonathan},
  - title = {{Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2021,
  - pages = {46--50},
  - month = jun,
  - doi = {10.1109/ICASSP39728.2021.9413358},
  - issn = {2379-190X},
  - isbn = {978-1-7281-7605-5},
  - url = {https://www.merl.com/publications/TR2021-069}
  - }
MERL Contacts:
- Gordon
  Wichern
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain. In this work, we use musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system. In contrast with previous score-informed separation approaches, our system does not require isolated sources, and score is used only as a training target, not required for inference. Our model consists of a separator that outputs a time-frequency mask for each instrument, and a transcriptor that acts as a critic, providing both temporal and frequency supervision to guide the learning of the separator. A harmonic mask constraint is introduced as another way of leveraging score information during training, and we propose two novel adversarial losses for additional fine-tuning of both the transcriptor and the separator. Results demonstrate that using score information outperforms temporal weak-labels, and adversarial structures lead to further improvements in both separation and transcription performance.

Related News & Events

NEWS Jonathan Le Roux gives invited talk at CMU's Language Technology Institute Colloquium
Date: December 9, 2022
Where: Pittsburg, PA
MERL Contact: Jonathan Le Roux
Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
Brief
- MERL Senior Principal Research Scientist and Speech and Audio Senior Team Leader, Jonathan Le Roux, was invited by Carnegie Mellon University's Language Technology Institute (LTI) to give an invited talk as part of the LTI Colloquium Series. The LTI Colloquium is a prestigious series of talks given by experts from across the country related to different areas of language technologies. Jonathan's talk, entitled "Towards general and flexible audio source separation", presented an overview of techniques developed at MERL towards the goal of robustly and flexibly decomposing and analyzing an acoustic scene, describing in particular the Speech and Audio Team's efforts to extend MERL's early speech separation and enhancement methods to more challenging environments, and to more general and less supervised scenarios.

Related Publication

Hung, Y.-N., Wichern, G., Le Roux, J., "Transcription is All You Need: Learning to Separate Musical Mixtures with Score as Supervision", arXiv, November 2020.

BibTeX arXiv

@article{Hung2020nov,
author = {Hung, Yun-Ning and Wichern, Gordon and {Le Roux}, Jonathan},
title = {{Transcription is All You Need: Learning to Separate Musical Mixtures with Score as Supervision}},
journal = {arXiv},
year = 2020,
month = nov,
url = {https://arxiv.org/abs/2010.11904}
}

MERL Contacts:

GordonWichern

JonathanLe Roux

Research Areas:

Abstract:

Gordon
Wichern

Jonathan
Le Roux