TR2020-040

Streaming Automatic Speech Recognition With The Transformer Model

- Moritz, N., Hori, T., Le Roux, J., "Streaming Automatic Speech Recognition With The Transformer Model", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP40776.2020.9054476, April 2020, pp. 6074-6078.
  BibTeX TR2020-040 PDF Video Presentation
  - @inproceedings{Moritz2020apr,
  - author = {Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
  - title = {{Streaming Automatic Speech Recognition With The Transformer Model}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2020,
  - pages = {6074--6078},
  - month = apr,
  - publisher = {IEEE},
  - doi = {10.1109/ICASSP40776.2020.9054476},
  - issn = {2379-190X},
  - isbn = {978-1-5090-6631-5},
  - url = {https://www.merl.com/publications/TR2020-040}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the “clean” and “other” test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

Related News & Events

NEWS MERL presenting 13 papers and an industry talk at ICASSP 2020
Date: May 4, 2020 - May 8, 2020
Where: Virtual Barcelona
MERL Contacts: Petros T. Boufounos; Chiori Hori; Toshiaki Koike-Akino; Jonathan Le Roux; Dehong Liu; Yanting Ma; Hassan Mansour; Philip V. Orlik; Anthony Vetro; Pu (Perry) Wang; Gordon Wichern
Research Areas: Computational Sensing, Computer Vision, Machine Learning, Signal Processing, Speech & Audio
Brief
- MERL researchers are presenting 13 papers at the IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which is being held virtually from May 4-8, 2020. Petros Boufounos is also presenting a talk on the Computational Sensing Revolution in Array Processing (video) in ICASSP’s Industry Track, and Siheng Chen is co-organizing and chairing a special session on a Signal-Processing View of Graph Neural Networks.
  
  Topics to be presented include recent advances in speech recognition, audio processing, scene understanding, computational sensing, array processing, and parameter estimation. Videos for all talks are available on MERL's YouTube channel, with corresponding links in the references below.
  
  This year again, MERL is a sponsor of the conference and will be participating in the Student Job Fair; please join us to learn about our internship program and career opportunities.
  
  ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year. Originally planned to be held in Barcelona, Spain, ICASSP has moved to a fully virtual setting due to the COVID-19 crisis, with free registration for participants not covering a paper.

Related Publication

Moritz, N., Hori, T., Le Roux, J., "Streaming automatic speech recognition with the transformer model", arXiv, January 2020.

BibTeX arXiv

@article{Moritz2020jan,
author = {Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
title = {{Streaming automatic speech recognition with the transformer model}},
journal = {arXiv},
year = 2020,
month = jan,
url = {https://arxiv.org/abs/2001.02674}
}

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Jonathan
Le Roux