TR2018-085

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features


    •  Hori, C., Alamri, H., Wang, J., Wichern, G., Hori, T., Cherian, A., Marks, T.K., Cartillier, V., Lopes, R., Das, A., Essa, I., Batra, D., Parikh, D., "End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features", arXiv, July 13, 2018.
      BibTeX Download PDF
      • @techreport{MERL_TR2018-085,
      • author = {Hori, C. and Alamri, H. and Wang, J. and Wichern, G. and Hori, T. and Cherian, A. and Marks, T.K. and Cartillier, V. and Lopes, R. and Das, A. and Essa, I. and Batra, D. and Parikh, D.},
      • title = {End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR2018-085},
      • month = jul,
      • year = 2018,
      • url = {http://www.merl.com/publications/TR2018-085/}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio


Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-ofthe-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer (QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on ~9, 000 videos. Using this new dataset, we trained an end-toend conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.