Early and Late Integration of Audio Features for Automatic Video Description

This paper presents our approach to improve video captioning by integrating audio and video features. Video captioning is the task of generating a textual description to describe the content of a video. State-of-the-art approaches to video captioning are based on sequence-to-sequence models, in which a single neural network accepts sequential images and audio data, and outputs a sequence of words that best describe the input data in natural language. The network thus learns to encode the video input into an intermediate semantic representation, which can be useful in applications such as multimedia indexing, automatic narration, and audio-visual question answering. In our prior work, we proposed an attentionbased multi-modal fusion mechanism to integrate image, motion, and audio features, where the multiple features are integrated in the network. Here, we apply hypothesis-level integration based on minimum Bayes-risk (MBR) decoding to further improve the caption quality, focusing on well-known evaluation metrics (BLEU and METEOR scores). Experiments with the YouTube2Text and MSR-VTT datasets demonstrate that combinations of early and late integration of multimodal features significantly improve the audio-visual semantic representation, as measured by the resulting caption quality. In addition, we compared the performance of our method using two different types of audio features: MFCC features, and the audio features extracted using SoundNet, which was trained to recognize objects and scenes from videos using only the audio signals.