MIT Develops AI for Long-Term Visual Task Planning, Doubling Effectiveness
MIT researchers have developed an artificial intelligence-driven method for planning long-term visual tasks, such as robot navigation. This innovative approach demonstrates approximately double the effectiveness of some existing techniques.
The new method utilizes a specialized vision-language model (VLM) to interpret scenarios from images and simulate actions required to achieve a goal. A second model then translates these simulations into a standard programming language for planning problems (PDDL) and refines the solution. The system automatically generates files that are fed into classical planning software, which subsequently computes a plan.
This two-step system achieved an average plan success rate of about 70 percent, significantly outperforming baseline methods, which reached approximately 30 percent. A key feature of the system is its ability to solve novel problems, making it suitable for dynamic real-world environments.
Yilun Hao, an MIT graduate student and lead author, stated that the framework combines the advantages of vision-language models' image understanding with the planning capabilities of a formal solver. "This enables it to generate reliable, long-horizon plans from a single image," Hao added.
System Design and Functionality
The researchers aimed to address the limitations of large language models (LLMs) in handling visual inputs for planning problems like robotic assembly or autonomous driving. While vision-language models (VLMs) process images and text, they often struggle with spatial relationships and multi-step reasoning, which are crucial for long-range planning.
Conversely, formal planners are robust at generating long-horizon plans but cannot process visual inputs directly and require expert knowledge to encode problems. The MIT team's system, named VLM-guided formal planning (VLMFP), seamlessly integrates the strengths of both.
VLMFP employs two specialized VLMs:
- SimVLM: A smaller model trained to describe image scenarios using natural language and simulate action sequences within that scenario.
- GenVLM: A larger model that uses SimVLM's descriptions to generate initial files in the Planning Domain Definition Language (PDDL). These files are then processed by a classical PDDL solver. GenVLM iteratively refines the PDDL files by comparing solver results with simulator outputs.
GenVLM's training on PDDL examples enables it to generate accurate PDDL files.
The framework produces two PDDL files: a domain file defining the environment and actions, and a problem file defining initial states and goals. This separation allows the domain file to remain consistent across instances, promoting generalization.
Performance and Future Work
The researchers carefully designed SimVLM's training data to ensure it understood problems and goals without memorizing specific patterns. SimVLM successfully described scenarios, simulated actions, and detected goal achievement in approximately 85 percent of experiments.
Overall, the VLMFP framework achieved about a 60 percent success rate on six 2D planning tasks and over 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly.
It also generated valid plans for more than 50 percent of previously unseen scenarios, significantly outperforming baseline methods.
The team plans to enhance VLMFP's ability to handle more complex scenarios and explore methods for identifying and mitigating potential 'hallucinations' by the VLMs. This research was partly funded by the MIT-IBM Watson AI Lab.