TR2023-034

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations


    •  Zhang, J., Cherian, A., Liu, Y., Shabat, I.B., Rodriguez, C., Gould, S., "Aligning Step-by-Step Instructional Diagrams to Video Demonstrations", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), May 2023.
      BibTeX TR2023-034 PDF
      • @inproceedings{Zhang2023may,
      • author = {Zhang, Jiahao and Cherian, Anoop and Liu, Yanbin and Shabat, Itzik Ben and Rodriguez, Cristian and Gould, Stephen},
      • title = {Aligning Step-by-Step Instructional Diagrams to Video Demonstrations},
      • booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      • year = 2023,
      • month = may,
      • url = {https://www.merl.com/publications/TR2023-034}
      • }
  • MERL Contact:
  • Research Areas:

    Computer Vision, Machine Learning, Speech & Audio

Abstract:

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations Jiahao Zhang1,* Anoop Cherian2 Yanbin Liu1 Yizhak Ben-Shabat1,3,† Cristian Rodriguez4 Stephen Gould1,‡ 1The Australian National University, 2Mitsubishi Electric Research Labs 3Technion Israel Institute of Technology, 4The Australian Institute for Machine Learning 1{first.last}@anu.edu.au 2cherian@merl.com 3sitzikbs@gmail.com 4crodriguezop@gmail.com https://davidzhang73.github.io/en/publication/zhang-cvpr-2023/ Abstract Multimodal alignment facilitates the retrieval of in- stances from one modality when queried using another. In this paper, we consider a novel setting where such an align- ment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. We introduce a supervised contrastive learning approach that learns to align videos with the subtle details of assembly diagrams, guided by a set of novel losses. To study this problem and evaluate the effectiveness of our method, we introduce a new dataset: IAW—for Ikea assembly in the wild—consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performance of our approach against alternatives.