End-to-End Audio Visual Scene-Aware Dialog Using Multimodal Attention-Based Video Features

In order for machines interacting with the real world to have conversations with users about the objects and events around them, they need to understand dynamic audiovisual scenes. The recent revolution of neural network models allows us to combine various modules into a single end-to-end differentiable network. As a result, Audio Visual Scene-Aware Dialog (AVSD) systems for real-world applications can be developed by integrating state-of-the-art technologies from multiple research areas, including end-to-end dialog technologies, visual question answering (VQA) technologies, and video description technologies. In this paper, we introduce a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using this new data set, that generates responses in a dialog about a video. By using features that were developed for multimodal attention-based video description, our system improves the quality of generated dialog about dynamic video scenes.


  • Related News & Events

  • Related Publication

  •  Hori, C., Alamri, H., Wang, J., Wichern, G., Hori, T., Cherian, A., Marks, T.K., Cartillier, V., Lopes, R., Das, A., Essa, I., Batra, D., Parikh, D., "End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features", arXiv, July 13, 2018.
    BibTeX arXiv
    • @article{Hori2018jul,
    • author = {Hori, Chiori and Alamri, Huda and Wang, Jue and Wichern, Gordon and Hori, Takaaki and Cherian, Anoop and Marks, Tim K. and Cartillier, Vincent and Lopes, Raphael and Das, Abhishek and Essa, Irfan and Batra, Dhruv and Parikh, Devi},
    • title = {End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features},
    • journal = {arXiv},
    • year = 2018,
    • month = jul,
    • url = {}
    • }