On the Use of Pretrained Deep Audio Encoders for Automated Audio Captioning Tasks


Automated audio captioning (AAC) is the task of describing an audio clip that may contain sounds from natural and/or human activities. In the context of autonomous driving, AAC models are a beneficial addition as they can enhance self-driving cars’ awareness of the surrounding acoustic environment. Despite being a rather new task, recurring DCASE challenges have sparked continuous research interests in AAC. Following the tremendous successes of pretrained deep models in various fields like computer vision and natural language processing, the majority of DCASE AAC challenge submissions have also used a pretrained deep audio encoder by default. However, not enough exploration and analyses have been done on this vital model component. In this work, we categorize and explain relevant pretraining tasks for deep audio encoders, and compare the downstream AAC performance when using four publicly-accessible pretrained audio encoders in our experiments. We find that, just by altering the pretrained audio encoder in an AAC model, the caption quality metric, SPIDEr-FL, can vary by as much as 15%, and inference speed by more than 100%. Finally, considering the tradeoff between speed and quality, we recommend favoring architecturally simpler audio encoders with more pretraining for time-sensitive applications like self-driving cars.