TR2020-138

All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection


    •  Moritz, N., Wichern, G., Hori, T., Le Roux, J., "All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection", Annual Conference of the International Speech Communication Association (Interspeech), October 2020.
      BibTeX TR2020-138 PDF
      • @inproceedings{Moritz2020oct,
      • author = {Moritz, Niko and Wichern, Gordon and Hori, Takaaki and Le Roux, Jonathan},
      • title = {All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2020,
      • month = oct,
      • url = {https://www.merl.com/publications/TR2020-138}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Machine Learning, Speech & Audio

Automatic speech recognition (ASR), audio tagging (AT), and acoustic event detection (AED) are typically treated as separate problems, where each task is tackled using specialized system architectures. This is in contrast with the way the human auditory system uses a single (binaural) pathway to process sound signals from different sources. In addition, an acoustic model trained to recognize speech as well as sound events could leverage multi-task learning to alleviate data scarcity problems in individual tasks. In this work, an all-in-one (AIO) acoustic model based on the Transformer architecture is trained to solve ASR, AT, and AED tasks simultaneously, where model parameters are shared across all tasks. For the ASR and AED tasks, the Transformer model is combined with the connectionist temporal classification (CTC) objective to enforce a monotonic ordering and to utilize timing information. Our experiments demonstrate that the AIO Transformer achieves better performance compared to all baseline systems of various recent DCASE challenge tasks and is suitable for the total transcription of an acoustic scene, i.e., to simultaneously transcribe speech and recognize the acoustic events occurring in it.