Learning to Separate Sounds From Weakly Labeled Scenes


Deep learning models for monaural audio source separation are typically trained on large collections of isolated sources, which may not be available in domains such as environmental monitoring. We propose objective functions and network architectures that enable training a source separation system with weak labels. In contrast with strong time-frequency (TF) labels, weak labels only indicate the time periods where different sources are active in this scenario. We train a separator that outputs a TF mask for each type of sound event, using a classifier to pool label estimates across frequency. Our objective function requires the classifier applied to a separated source to output weak labels for the class corresponding to that source and zeros for all other classes. The objective function also enforces that the separated sources sum to the mixture. We benchmark performance using synthetic mixtures of overlapping sound events recorded in urban environments. Compared to training on mixtures and their isolated sources, our model still achieves significant SDR improvement.


  • Related News & Events

  • Related Video