TR2024-006

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

- Boeddeker, C., Subramanian, A.S., Wichern, G., Haeb-Umbach, R., Le Roux, J., "TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings", IEEE/ACM Transactions on Audio, Speech, and Language Processing, DOI: 10.1109/TASLP.2024.3350887, Vol. 32, pp. 1185-1197, February 2024.
  BibTeX TR2024-006 PDF
  - @article{Boeddeker2024feb,
  - author = {Boeddeker, Christoph and Subramanian, Aswin Shanmugam and Wichern, Gordon and Haeb-Umbach, Reinhold and Le Roux, Jonathan},
  - title = {TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings},
  - journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  - year = 2024,
  - volume = 32,
  - pages = {1185--1197},
  - month = feb,
  - doi = {10.1109/TASLP.2024.3350887},
  - issn = {2329-9304},
  - url = {https://www.merl.com/publications/TR2024-006}
  - }
MERL Contacts:
- Gordon
  Wichern
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target- speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

Related Publication

Boeddeker, C., Subramanian, A.S., Wichern, G., Haeb-Umbach, R., Le Roux, J., "TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings", arXiv, March 2023.

BibTeX arXiv

@article{Boeddeker2023mar,
author = {Boeddeker, Christoph and Subramanian, Aswin Shanmugam and Wichern, Gordon and Haeb-Umbach, Reinhold and Le Roux, Jonathan},
title = {TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings},
journal = {arXiv},
year = 2023,
month = mar,
url = {https://arxiv.org/abs/2303.03849}
}

MERL Contacts:

GordonWichern

JonathanLe Roux

Research Areas:

Abstract:

Gordon
Wichern

Jonathan
Le Roux