TR2020-139

Transformer-based Long-context End-to-end Speech Recognition

- Hori, T., Moritz, N., Hori, C., Le Roux, J., "Transformer-based Long-context End-to-end Speech Recognition", Interspeech, DOI: 10.21437/Interspeech.2020-2928, October 2020, pp. 5011-5015.
  BibTeX TR2020-139 PDF Presentation
  - @inproceedings{Hori2020oct,
  - author = {Hori, Takaaki and Moritz, Niko and Hori, Chiori and {Le Roux}, Jonathan},
  - title = {{Transformer-based Long-context End-to-end Speech Recognition}},
  - booktitle = {Interspeech},
  - year = 2020,
  - pages = {5011--5015},
  - month = oct,
  - doi = {10.21437/Interspeech.2020-2928},
  - issn = {1990-9772},
  - url = {https://www.merl.com/publications/TR2020-139}
  - }
MERL Contacts:
- Chiori
  Hori
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

This paper presents an approach to long-context end-to-end automatic speech recognition (ASR) using Transformers, aiming at improving ASR accuracy for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR systems are basically designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. There are some prior studies on RNN-based models that utilize such contextual information, but very few on Transformers, which are becoming more popular in end-to-end ASR. In this paper, we propose a Transformer-based architecture that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance. This is repeated in a sliding window fashion with one-utterance shifts to recognize the entire recording. Based on this framework, we also investigate how to design the context window and train the model effectively in monologue (one speaker) and dialogue (two speakers) scenarios. We demonstrate the effectiveness of our approach using monologue benchmarks on CSJ and TED-LIUM3 and dialogue benchmarks on SWITCHBOARD and HKUST, showing significant error reduction from single-utterance ASR baselines with or without speaker i-vectors

MERL Contacts:

ChioriHori

JonathanLe Roux

Research Areas:

Abstract:

Chiori
Hori

Jonathan
Le Roux