TR2019-003

Teacher-Student Deep Clustering For Low-Delay Channel Speech Separation

- Aihara, R., Hanazawa, T., Okato, Y., Wichern, G., Le Roux, J., "Teacher-Student Deep Clustering For Low-Delay Channel Speech Separation", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP.2019.8682695, May 2019.
  BibTeX TR2019-003 PDF
  - @inproceedings{Aihara2019may,
  - author = {Aihara, Ryo and Hanazawa, Toshiyuki and Okato, Yohei and Wichern, Gordon and {Le Roux}, Jonathan},
  - title = {{Teacher-Student Deep Clustering For Low-Delay Channel Speech Separation}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2019,
  - month = may,
  - doi = {10.1109/ICASSP.2019.8682695},
  - url = {https://www.merl.com/publications/TR2019-003}
  - }
MERL Contacts:
- Gordon
  Wichern
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

The recently-proposed deep clustering algorithm introduced significant advances in monaural speaker-independent multi-speaker speech separation. Deep clustering operates on magnitude spectrograms using bidirectional recurrent networks and K-means clustering, both of which require offline operation, i.e., algorithm latency is longer than utterance length. This paper evaluates architectures for reduced latency deep clustering by combining: (1) block processing to efficiently propagate the memory encoded by the recurrent network, and (2) teacher-student learning, where low-latency models learn from an offline teacher. Compared to our best performing offline model, we only lose 0.3 dB SDR at a latency of 1.2 seconds and 0.7 dB SDR at a latency of 0.6 seconds on the publicly available wsj0-2mix dataset. Moreover, by providing a detailed analysis of the failure cases for our low-latency speech separation models, we show that the cause of this performance gap is related to frame-level permutation errors, where the network fails to accurately track speaker identity throughout an utterance.

Related News & Events

NEWS MERL presenting 16 papers at ICASSP 2019
Date: May 12, 2019 - May 17, 2019
Where: Brighton, UK
MERL Contacts: Petros T. Boufounos; Anoop Cherian; Chiori Hori; Toshiaki Koike-Akino; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Tim K. Marks; Philip V. Orlik; Anthony Vetro; Pu (Perry) Wang; Gordon Wichern
Research Areas: Computational Sensing, Computer Vision, Machine Learning, Signal Processing, Speech & Audio
Brief
- MERL researchers will be presenting 16 papers at the IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which is being held in Brighton, UK from May 12-17, 2019. Topics to be presented include recent advances in speech recognition, audio processing, scene understanding, computational sensing, and parameter estimation. MERL is also a sponsor of the conference and will be participating in the student career luncheon; please join us at the lunch to learn about our internship program and career opportunities.
  
  ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year.

MERL Contacts:

GordonWichern

JonathanLe Roux

Research Areas:

Abstract:

Gordon
Wichern

Jonathan
Le Roux