TR2026-026

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning


    •  Hu, H., Liu, C., Li, N., Wang, Y., "Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning", IEEE Control Systems Letters, DOI: 10.1109/​LCSYS.2025.3642767, Vol. 9, pp. 2879-2884, February 2026.
      BibTeX TR2026-026 PDF
      • @article{Hu2026feb,
      • author = {Hu, Hanjiang and Liu, Changliu and Li, Na and Wang, Yebin},
      • title = {{Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning}},
      • journal = {IEEE Control Systems Letters},
      • year = 2026,
      • volume = 9,
      • pages = {2879--2884},
      • month = feb,
      • doi = {10.1109/LCSYS.2025.3642767},
      • url = {https://www.merl.com/publications/TR2026-026}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Machine Learning, Optimization

Abstract:

Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for au- tonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi- turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.

 

  • Related News & Events

    •  NEWS    MERL researchers present 8 papers at ACC 2026
      Date: May 26, 2026 - May 29, 2026
      Where: New Orleans, USA
      MERL Contacts: Scott A. Bortoff; Vedang M. Deshpande; Stefano Di Cairano; Christopher R. Laughman; Jordan Leung; Hongtao Qiao; Zhaolin Ren; Abraham P. Vinod; Yebin Wang
      Research Areas: Control, Dynamical Systems, Optimization, Robotics
      Brief
      • MERL researchers presented 8 papers at the recently concluded American Control Conference (ACC) 2026 in New Orleans, USA. The papers covered a wide range of topics including robust controllable set computation, vapor compression cycle calibration, task-reasoning LLM agents, Minkowski-cost stable MPC, polynomial chaos approximation, invariant-set motion planning, heat-pump MPC architectures, and relaxed barrier-function MPC. Additionally, Zhaolin Ren was an invited speaker at Multi-Agent Dynamic Games workshop, and Abraham Vinod served as a panelist at the Professional Development and Career Advice for Young Professionals session.

        As a sponsor of the conference, MERL maintained a booth for open discussions with researchers and students, and hosted a special session to discuss highlights of MERL research and work philosophy.
    •  
  • Related Publication

  •  Hu, H., Liu, C., Li, N., Wang, Y., "Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning", arXiv, September 2025.
    BibTeX arXiv
    • @article{Hu2025sep,
    • author = {Hu, Hanjiang and Liu, Changliu and Li, Na and Wang, Yebin},
    • title = {{Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning}},
    • journal = {arXiv},
    • year = 2025,
    • month = sep,
    • url = {https://arxiv.org/abs/2509.20616}
    • }