TR2025-167

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM


    •  Hori, C., Masuyama, Y., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2025.
      BibTeX TR2025-167 PDF
      • @inproceedings{Hori2025dec,
      • author = {Hori, Chiori and Masuyama, Yoshiki and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
      • title = {{Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM}},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2025,
      • month = dec,
      • url = {https://www.merl.com/publications/TR2025-167}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Robotics, Speech & Audio

Abstract:

Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human- robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the- art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long- context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

 

  • Related Publication

  •  Hori, C., Masuyama, Y., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM", arXiv, November 2025.
    BibTeX arXiv
    • @article{Hori2025nov,
    • author = {Hori, Chiori and Masuyama, Yoshiki and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
    • title = {{Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM}},
    • journal = {arXiv},
    • year = 2025,
    • month = nov,
    • url = {https://arxiv.org/abs/2511.17335}
    • }