All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection


Automatic speech recognition (ASR), audio tagging (AT), and acoustic event detection (AED) are typically treated as separate problems, where each task is tackled using specialized system architectures. This is in contrast with the way the human auditory system uses a single (binaural) pathway to process sound signals from different sources. In addition, an acoustic model trained to recognize speech as well as sound events could leverage multi-task learning to alleviate data scarcity problems in individual tasks. In this work, an all-in-one (AIO) acoustic model based on the Transformer architecture is trained to solve ASR, AT, and AED tasks simultaneously, where model parameters are shared across all tasks. For the ASR and AED tasks, the Transformer model is combined with the connectionist temporal classification (CTC) objective to enforce a monotonic ordering and to utilize timing information. Our experiments demonstrate that the AIO Transformer achieves better performance compared to all baseline systems of various recent DCASE challenge tasks and is suitable for the total transcription of an acoustic scene, i.e., to simultaneously transcribe speech and recognize the acoustic events occurring in it.