TR2024-043

RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation


    •  Yang, Z., Liu, J., Chen, P., Cherian, A., Marks, T.K., Le Roux, J., Gan, C., "RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), April 2024.
      BibTeX TR2024-043 PDF
      • @inproceedings{Yang2024apr,
      • author = {Yang, Zeyuan and Liu, Jiageng and Chen, Peihao and Cherian, Anoop and Marks, Tim K. and Le Roux, Jonathan and Gan, Chuang},
      • title = {RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation},
      • booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      • year = 2024,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2024-043}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio

Abstract:

We leverage Large Language Models (LLM) for zero- shot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for reinforcement learning, yet achieve relatively low success rates and lack generalizability. The intermittent nature of auditory signals further poses additional obstacles to infer- ring the goal information. To address this challenge, we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sensory data, we instruct an LLM-based planner to actively explore the environment. During the exploration, our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally, we introduce an auxiliary LLM- based assistant to enhance global environmental comprehension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analysis, we show that our method outperforms relevant base- lines without training demonstrations from the environment and complementary semantic information.