TR2025-167

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

- Hori, C., Masuyama, Y., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2025.
  BibTeX TR2025-167 PDF
  - @inproceedings{Hori2025dec,
  - author = {Hori, Chiori and Masuyama, Yoshiki and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
  - title = {{Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM}},
  - booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
  - year = 2025,
  - month = dec,
  - url = {https://www.merl.com/publications/TR2025-167}
  - }
MERL Contacts:
Research Areas:

Artificial Intelligence, Computer Vision, Machine Learning, Robotics, Speech & Audio

Abstract:

Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human- robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the- art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long- context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

Related Publication

Hori, C., Masuyama, Y., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM", arXiv, November 2025.

BibTeX arXiv

@article{Hori2025nov,
author = {Hori, Chiori and Masuyama, Yoshiki and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
title = {{Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM}},
journal = {arXiv},
year = 2025,
month = nov,
url = {https://arxiv.org/abs/2511.17335}
}

TR2025-167

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

MERL Contacts:

Chiori
Hori

Yoshiki
Masuyama

Diego
Romeres

Radu
Corcodel

Siddarth
Jain

Jonathan
Le Roux

Research Areas:

Abstract:

Related Publication

MERL Contacts:

ChioriHori

YoshikiMasuyama

DiegoRomeres

RaduCorcodel

SiddarthJain

JonathanLe Roux

Research Areas:

Abstract:

Chiori
Hori

Yoshiki
Masuyama

Diego
Romeres

Radu
Corcodel

Siddarth
Jain

Jonathan
Le Roux