Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision


Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain. In this work, we use musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system. In contrast with previous score-informed separation approaches, our system does not require isolated sources, and score is used only as a training target, not required for inference. Our model consists of a separator that outputs a time-frequency mask for each instrument, and a transcriptor that acts as a critic, providing both temporal and frequency supervision to guide the learning of the separator. A harmonic mask constraint is introduced as another way of leveraging score information during training, and we propose two novel adversarial losses for additional fine-tuning of both the transcriptor and the separator. Results demonstrate that using score information outperforms temporal weak-labels, and adversarial structures lead to further improvements in both separation and transcription performance.


  • Related News & Events

    •  NEWS    Jonathan Le Roux gives invited talk at CMU's Language Technology Institute Colloquium
      Date: December 9, 2022
      Where: Pittsburg, PA
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      • MERL Senior Principal Research Scientist and Speech and Audio Senior Team Leader, Jonathan Le Roux, was invited by Carnegie Mellon University's Language Technology Institute (LTI) to give an invited talk as part of the LTI Colloquium Series. The LTI Colloquium is a prestigious series of talks given by experts from across the country related to different areas of language technologies. Jonathan's talk, entitled "Towards general and flexible audio source separation", presented an overview of techniques developed at MERL towards the goal of robustly and flexibly decomposing and analyzing an acoustic scene, describing in particular the Speech and Audio Team's efforts to extend MERL's early speech separation and enhancement methods to more challenging environments, and to more general and less supervised scenarios.
  • Related Publication

  •  Hung, Y.-N., Wichern, G., Le Roux, J., "Transcription is All You Need: Learning to Separate Musical Mixtures with Score as Supervision", arXiv, November 2020.
    BibTeX arXiv
    • @article{Hung2020nov,
    • author = {Hung, Yun-Ning and Wichern, Gordon and Le Roux, Jonathan},
    • title = {Transcription is All You Need: Learning to Separate Musical Mixtures with Score as Supervision},
    • journal = {arXiv},
    • year = 2020,
    • month = nov,
    • url = {}
    • }