TR2025-068

KitchenVLA: Iterative Vision-Language Corrections for Robotic Execution of Human Tasks


    •  Lu, K., Ma, C., Hori, C., Romeres, D., "KitchenVLA: Iterative Vision-Language Corrections for Robotic Execution of Human Tasks", IEEE International Conference on Robotics and Automation Workshop on Safely Leveraging Vision-Language Foundation Models in Robotics (SafeLVMs@ICRA), May 2025.
      BibTeX TR2025-068 PDF
      • @inproceedings{Lu2025may,
      • author = {Lu, Kai and Ma, Chenyang and Hori, Chiori and Romeres, Diego},
      • title = {{KitchenVLA: Iterative Vision-Language Corrections for Robotic Execution of Human Tasks}},
      • booktitle = {IEEE International Conference on Robotics and Automation Workshop on Safely Leveraging Vision-Language Foundation Models in Robotics (SafeLVMs@ICRA)},
      • year = 2025,
      • month = may,
      • url = {https://www.merl.com/publications/TR2025-068}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Robotics, Speech & Audio

Abstract:

In this paper, we present KitchenVLA, a Vision- Language-Action (VLA) framework for generating and optimizing executable robot actions from human instructional videos. While recent advances in video understanding and step generation have shown promising results, translating these steps into robot-executable actions remains challenging, particularly for complex, long-horizon tasks such as those in kitchen environments. These challenges arise from domain discrepancies between human videos and robotic settings, as well as mismatches between human actions and robot capabilities. To address these issues, we propose a zero-shot action planning and correction framework, where a Vision-Language Model (VLM) acts as an evaluator to analyze both the original human video and the robot’s observations to detect domain mismatches. The system assesses differences in object states and action feasibility, and generates corrective actions to align the robot’s execution with the intended task. By incorporating keyframe selection, language-guided segmentation, and simulation-based verification, KitchenVLA iteratively refines robotic plans to ensure contextual accuracy and executability. Through domain- aware evaluation and correction, our framework enhances the adaptability and robustness of robotic task execution in kitchen environments, advancing the integration of VLMs into robot learning and executable plan correction.

 

  • Related News & Events

    •  NEWS    MERL contributes to ICRA 2025
      Date: May 19, 2025 - May 23, 2025
      Where: IEEE ICRA
      MERL Contacts: Stefano Di Cairano; Jianlin Guo; Chiori Hori; Siddarth Jain; Devesh K. Jha; Toshiaki Koike-Akino; Philip V. Orlik; Arvind Raghunathan; Diego Romeres; Yuki Shirai; Abraham P. Vinod; Yebin Wang
      Research Areas: Artificial Intelligence, Computer Vision, Control, Dynamical Systems, Machine Learning, Optimization, Robotics, Human-Computer Interaction
      Brief
      • MERL made significant contributions to both the organization and the technical program of the International Conference on Robotics and Automation (ICRA) 2025, which was held in Atlanta, Georgia, USA, from May 19th to May 23rd.

        MERL was a Bronze sponsor of the conference, and MERL researchers chaired four sessions in the areas of Manipulation Planning, Human-Robot Collaboration, Diffusion Policy, and Learning for Robot Control.

        MERL researchers presented four papers in the main conference on the topics of contact-implicit trajectory optimization, proactive robotic assistance in human-robot collaboration, diffusion policy with human preferences, and dynamic and model learning of robotic manipulators. In addition, five more papers were presented in the workshops: “Structured Learning for Efficient, Reliable, and Transparent Robots,” “Safely Leveraging Vision-Language Foundation Models in Robotics: Challenges and Opportunities,” “Long-term Human Motion Prediction,” and “The Future of Intelligent Manufacturing: From Innovation to Implementation.”

        MERL researcher Diego Romeres delivered an invited talk titled “Dexterous Robotics: From Multimodal Sensing to Real-World Physical Interactions.”

        MERL also collaborated with the University of Padua on one of the conference’s challenges: the “3rd AI Olympics with RealAIGym” (https://ai-olympics.dfki-bremen.de).

        During the conference, MERL researchers received the IEEE Transactions on Automation Science and Engineering Best New Application Paper Award for their paper titled “Smart Actuation for End-Edge Industrial Control Systems.”

        About ICRA

        The IEEE International Conference on Robotics and Automation (ICRA) is the flagship conference of the IEEE Robotics and Automation Society and the world’s largest and most comprehensive technical conference focused on research advances and the latest technological developments in robotics. The event attracts over 7,000 participants, 143 partners and exhibitors, and receives more than 4,000 paper submissions.
    •