Audio-Visual Scene-Aware Dialog

    •  Alamri, H., Cartillier, V., Das, A., Wang, J., Lee, S., Anderson, P., Essa, I., Parikh, D., Batra, D., Cherian, A., Marks, T.K., Hori, C., "Audio-Visual Scene-Aware Dialog", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), DOI: 10.1109/CVPR.2019.00774, June 2019, pp. 7550-7559.
      BibTeX TR2019-048 PDF
      • @inproceedings{Alamri2019jun,
      • author = {Alamri, Huda and Cartillier, Vincent and Das, Abhishek and Wang, Jue and Lee, Stefan and Anderson, Peter and Essa, Irfan and Parikh, Devi and Batra, Dhruv and Cherian, Anoop and Marks, Tim K. and Hori, Chiori},
      • title = {Audio-Visual Scene-Aware Dialog},
      • booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      • year = 2019,
      • pages = {7550--7559},
      • month = jun,
      • doi = {10.1109/CVPR.2019.00774},
      • url = {}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio

We introduce the task of scene-aware dialog. Given a follow-up question in an ongoing dialog about a video, our goal is to generate a complete and natural response to a question given (a) an input video, and (b) the history of previous turns in the dialog. To succeed, agents must ground the semantics in the video and leverage contextual cues from the history of the dialog to answer the question. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) dataset. For each of more than 11,000 videos of human actions for the Charades dataset. Our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using several qualitative and quantitative metrics. Our results indicate that the models must comprehend all the available inputs (video, audio, question and dialog history) to perform well on this dataset.


  • Related News & Events

    •  NEWS   MERL's Scene-Aware Interaction Technology Featured in Mitsubishi Electric Corporation Press Release
      Date: July 22, 2020
      Where: Tokyo, Japan
      MERL Contacts: Siheng Chen; Anoop Cherian; Bret Harsham; Chiori Hori; Takaaki Hori; Jonathan Le Roux; Tim Marks; Alan Sullivan; Anthony Vetro
      Research Areas: Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio
      • Mitsubishi Electric Corporation announced that the company has developed what it believes to be the world’s first technology capable of highly natural and intuitive interaction with humans based on a scene-aware capability to translate multimodal sensing information into natural language.

        The novel technology, Scene-Aware Interaction, incorporates Mitsubishi Electric’s proprietary Maisart® compact AI technology to analyze multimodal sensing information for highly natural and intuitive interaction with humans through context-dependent generation of natural language. The technology recognizes contextual objects and events based on multimodal sensing information, such as images and video captured with cameras, audio information recorded with microphones, and localization information measured with LiDAR.

        Scene-Aware Interaction for car navigation, one target application, will provide drivers with intuitive route guidance. The technology is also expected to have applicability to human-machine interfaces for in-vehicle infotainment, interaction with service robots in building and factory automation systems, systems that monitor the health and well-being of people, surveillance systems that interpret complex scenes for humans and encourage social distancing, support for touchless operation of equipment in public areas, and much more. The technology is based on recent research by MERL's Speech & Audio and Computer Vision groups.

        Demonstration Video:


        Mitsubishi Electric Corporation Press Release