TR2025-054

An End-to-End Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

- Masuyama, Y., Chang, X., Zhang, W., Cornell, S., Wang, Z.-Q., Ono, N., Qian, Y., Watanabe, S., "An End-to-End Integration of Speech Separation and Recognition with Self-Supervised Learning Representation", Computer Speech & Language, DOI: 10.1016/j.csl.2025.101813, Vol. 95, pp. 101813, May 2025.
  BibTeX TR2025-054 PDF
  - @article{Masuyama2025may,
  - author = {Masuyama, Yoshiki and Chang, Xuankai and Zhang, Wangyou and Cornell, Samuele and Wang, Zhong-Qiu and Ono, Nobutaka and Qian, Yanmin and Watanabe, Shinji},
  - title = {{An End-to-End Integration of Speech Separation and Recognition with Self-Supervised Learning Representation}},
  - journal = {Computer Speech \& Language},
  - year = 2025,
  - volume = 95,
  - pages = 101813,
  - month = may,
  - doi = {10.1016/j.csl.2025.101813},
  - issn = {0885-2308},
  - url = {https://www.merl.com/publications/TR2025-054}
  - }
MERL Contact:
- Yoshiki
  Masuyama
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human-computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE meth- ods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end fine-tuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.

MERL Contact:

YoshikiMasuyama

Research Areas:

Abstract:

Yoshiki
Masuyama