MIT and MIT-IBM Watson AI Lab researchers have created a navigation method that converts visual inputs into text to guide robots through tasks using a language model.
Someday, you might want a home robot to carry laundry to the basement, a task requiring it to combine verbal instructions with visual cues. However, this is challenging for AI agents as current systems need multiple complex machine-learning models and extensive visual data, which are hard to obtain.
Researchers from MIT and the MIT-IBM Watson AI Lab have developed a navigation method that translates visual inputs into text descriptions. A large language model then processes these descriptions to guide a robot through multistep tasks. This approach, which uses text captions instead of computationally intensive visual representations, allows the model to generate extensive synthetic training data efficiently.
Solving a vision problem with language
Researchers have developed a navigation method for robots using a simple captioning model that translates visual observations into text descriptions. These descriptions, along with verbal instructions, are input into a large language model, which then decides the robot’s next step. After each step, the model generates a scene caption to help update the robot’s trajectory, continually guiding it towards its goal. The information is standardized in templates, presenting it as a series of choices based on the surroundings, like choosing to move towards a door or an office, streamlining the decision-making process.
Advantages of language
When tested, this language-based navigation approach didn’t outperform vision-based methods but offered distinct advantages. It uses fewer resources, allowing for rapid synthetic data generation—for instance, creating 10,000 synthetic trajectories from only 10 real-world ones. Also, its use of natural language makes the system more understandable to humans and versatile across different tasks, using a single type of input. However, it does lose some information that vision-based models capture, like depth. Surprisingly, combining this language-based approach with vision-based methods improves navigation capabilities.
Researchers aim to enhance their method by developing a navigation-focused captioner and exploring how large language models can demonstrate spatial awareness to improve navigation.