Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using studentteacher learning, in which a student network is (jointly) trained to mimic the teacher’s responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries