Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers


This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR systems are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance and showed that the model achieves 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown the efficacy of the proposed Transformer, there is still room to improve the model architecture, and are some remaining issues on the decoding strategy. In this paper, we extend our approach by introducing (1) Conformer architecture to further improve the accuracy, (2) accelerated decoding with activation recycling, and (3) trigger-attention-based streaming decoding. We demonstrate that the extended Transformer outperforms state-of-the-art end-to-end ASR accuracy in HKUST and SWITCHBOARD benchmarks, and the new decoding method reduces more than 50% of the original decoding time and further enables streaming ASR.


  • Related Publication

  •  Hori, T., Moritz, N., Hori, C., Le Roux, J., "Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers", arXiv, April 2021.
    BibTeX arXiv
    • @article{Hori2021apr,
    • author = {Hori, Takaaki and Moritz, Niko and Hori, Chiori and Le Roux, Jonathan},
    • title = {Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers},
    • journal = {arXiv},
    • year = 2021,
    • month = apr,
    • url = {}
    • }