Researchers at Carnegie Mellon University and Olin College of Engineering have explored using contact microphones to train ML models for robot manipulation with audio data.
Robots designed for real-world tasks in various settings must effectively grasp and manipulate objects. Recent developments in machine learning-based models have aimed to enhance these capabilities. While successful models often rely on extensive pretraining on datasets filled mainly with visual data, some also integrate tactile information to improve performance.
Researchers at Carnegie Mellon University and Olin College of Engineering have investigated contact microphones as an alternative to traditional tactile sensors. This approach allows the training of machine learning models for robot manipulation using audio data.
In contrast to the abundance of visual data, it is still being determined what relevant internet-scale data could be used for pretraining other modalities like tactile sensing, which is increasingly crucial in the low-data regimes typical in robotics applications. This gap is addressed by using contact microphones as an alternative tactile sensor.
In their recent research, the team used a self-supervised machine learning model that was pre-trained on the Audioset dataset, which includes over 2 million 10-second video clips featuring various sounds and music collected from the web. This model employs audio-visual instance discrimination (AVID), a method capable of distinguishing between diverse types of audio-visual content.
The team evaluated their model by conducting tests where a robot had to complete real-world manipulation tasks based on no more than 60 demonstrations per task. The results were very encouraging. The model demonstrated superior performance compared to those relying solely on visual data, especially in scenarios where the objects and settings varied significantly from the training dataset.
The key insight is that contact microphones inherently capture audio-based information. This allows the use of large-scale audiovisual pretraining to obtain representations that enhance the performance of robotic manipulation. This method is the first to leverage large-scale multisensory pre-training for robotic manipulation.
Looking ahead, the team’s research could pave the way for advanced robot manipulation using pre-trained multimodal machine learning models. Their approach has the potential for further enhancement and wider testing across diverse real-world manipulation tasks.
Reference: Jared Mejia et al, Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation, arXiv (2024). DOI: 10.48550/arxiv.2405.08576