Transformer-based Long-context End-to-end Speech Recognition

This paper presents an approach to long-context end-to-end automatic speech recognition (ASR) using Transformers, aiming at improving ASR accuracy for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR systems are basically designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. There are some prior studies on RNN-based models that utilize such contextual information, but very few on Transformers, which are becoming more popular in end-to-end ASR. In this paper, we propose a Transformer-based architecture that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance. This is repeated in a sliding window fashion with one-utterance shifts to recognize the entire recording. Based on this framework, we also investigate how to design the context window and train the model effectively in monologue (one speaker) and dialogue (two speakers) scenarios. We demonstrate the effectiveness of our approach using monologue benchmarks on CSJ and TED-LIUM3 and dialogue benchmarks on SWITCHBOARD and HKUST, showing significant error reduction from single-utterance ASR baselines with or without speaker i-vectors