TR2022-016

Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10


    •  Hori, C., Shah, A.P., Geng, S., Gao, P., Cherian, A., Hori, T., Le Roux, J., Marks, T.K., "Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10", The 10th Dialog System Technology Challenge Workshop at AAAI, February 2022.
      BibTeX TR2022-016 PDF
      • @inproceedings{Hori2022feb,
      • author = {Hori, Chiori and Shah, Ankit Parag and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Le Roux, Jonathan and Marks, Tim K.},
      • title = {Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10},
      • booktitle = {The 10th Dialog System Technology Challenge Workshop at AAAI},
      • year = 2022,
      • month = feb,
      • url = {https://www.merl.com/publications/TR2022-016}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Human-Computer Interaction, Speech & Audio

Abstract:

The Audio-Visual Scene-Aware Dialog (AVSD) task was proposed in the Dialog System Technology Challenge (DSTC), where an AVSD dataset was collected and AVSD technologies were developed. An AVSD challenge track was hosted at both the 7th and 8th DSTCs (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, a third AVSD challenge is proposed, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and the new extension of the AVSD dataset for DSTC10, for which humangenerated temporal reasoning data were collected. A baseline system was built using an AV-transformer and the new datasets were released for the challenge. Finally, this paper reports the challenge results of 12 systems submitted to the
AVSD task in DSTC10. The two systems using GPT-2 based multimodal transformer have achieved the best performance for human rating, BLEU4 and CIDEr. The temporal reasoning performed by those systems has outperformed the baseline method with temporal attention.