TR2020-037

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory For End-To-End ASR

- Sari, L., Moritz, N., Hori, T., Le Roux, J., "Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory For End-To-End ASR", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9054249, April 2020, pp. 7384-7388.
  BibTeX TR2020-037 PDF Video Presentation
  - @inproceedings{Sari2020apr,
  - author = {Sari, Leda and Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
  - title = {{Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory For End-To-End ASR}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2020,
  - pages = {7384--7388},
  - month = apr,
  - publisher = {IEEE},
  - doi = {10.1109/ICASSP40776.2020.9054249},
  - issn = {2379-190X},
  - isbn = {978-1-5090-6631-5},
  - url = {https://www.merl.com/publications/TR2020-037}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoderdecoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

Related News & Events

NEWS MERL presenting 13 papers and an industry talk at ICASSP 2020
Date: May 4, 2020 - May 8, 2020
Where: Virtual Barcelona
MERL Contacts: Petros T. Boufounos; Chiori Hori; Toshiaki Koike-Akino; Jonathan Le Roux; Dehong Liu; Yanting Ma; Hassan Mansour; Philip V. Orlik; Anthony Vetro; Pu (Perry) Wang; Gordon Wichern
Research Areas: Computational Sensing, Computer Vision, Machine Learning, Signal Processing, Speech & Audio
Brief
- MERL researchers are presenting 13 papers at the IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which is being held virtually from May 4-8, 2020. Petros Boufounos is also presenting a talk on the Computational Sensing Revolution in Array Processing (video) in ICASSP’s Industry Track, and Siheng Chen is co-organizing and chairing a special session on a Signal-Processing View of Graph Neural Networks.
  
  Topics to be presented include recent advances in speech recognition, audio processing, scene understanding, computational sensing, array processing, and parameter estimation. Videos for all talks are available on MERL's YouTube channel, with corresponding links in the references below.
  
  This year again, MERL is a sponsor of the conference and will be participating in the Student Job Fair; please join us to learn about our internship program and career opportunities.
  
  ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year. Originally planned to be held in Barcelona, Spain, ICASSP has moved to a fully virtual setting due to the COVID-19 crisis, with free registration for participants not covering a paper.

Related Publication

Sari, L., Moritz, N., Hori, T., Le Roux, J., "Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR", arXiv, February 2020.

BibTeX arXiv

@article{Sari2020feb,
author = {Sari, Leda and Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
title = {{UNSUPERVISED SPEAKER ADAPTATION USING ATTENTION-BASED SPEAKER MEMORY FOR END-TO-END ASR}},
journal = {arXiv},
year = 2020,
month = feb,
url = {https://arxiv.org/abs/2002.06165}
}

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Jonathan
Le Roux