Attention-Based Multimodal Fusion for Video Description

A novel neural network architecture that fuses multimodal information using a modality-dependent attention mechanism.

MERL Researchers: Chiori Hori, Takaaki Hori, John Hershey, Teng-Yok Lee, Tim K. Marks.

Search MERL publications by keyword: Computer Vision, Speech & Audio, Video Description.

Understanding scenes through sensed information is a fundamental challenge for man machine interface. We aim to develop methods for learning semantic representations from multimodal information, including both visual and audio data, as the basis for intelligent communications and interface with machines. Towards this goal, we invented a modality-dependent attention mechanism for video captioning based on encoder-decoder sentence generation using recurrent neural networks (RNNs).

Figure 1. Simple feature fusion
Figure 2. Our multimodal attention mechanism

Our method provides an effective way to fuse multimodal information where the attention model not only selectively attends to specific times (temporal attention) or spatial regions (spatial attention) as done in conventional schemes, but also to specific input modalities such as image features, motion features, and audio features; we refer to this as modal attention. We evaluated our method on the Youtube2Text dataset, achieving results that are competitive with current state of the art. The results show that our model incorporating multimodal attention as well as temporal attention outperforms the model that uses temporal attention alone.


1. Sample videos showing improvements due to our multimodal attention mechanism
2.1. Audio features enable identification of "peeling" action
2.2. Audio features make the description worse because the audio track contains only background music
2.3. Multimodal attention mechanism and audio features are complementary.