- The system could improve image quality in video streaming or help autonomous vehicles identify road hazards in real time.
- Bridging the gap between computational complexity and accuracy.
Semantic segmentation, which involves categorising each pixel in a high-resolution image to identify objects and potential obstructions, demands substantial computational resources, making real-time processing particularly challenging on devices with limited hardware capabilities. This situation presents a considerable impediment for autonomous vehicles, which rely heavily on swift and accurate object recognition to navigate safely.
In a stride towards optimising real-time image analysis, MIT and the MIT-IBM Watson AI Lab have introduced an efficient computer vision model to facilitate high-resolution semantic segmentation, a technique for autonomous vehicles to discern road hazards instantaneously. The development by the MIT research team seeks to overhaul this scenario by introducing a more streamlined approach, thereby reducing the computational complexity of high-resolution image analyses. Contrary to existing models that increase calculations quadratically with image resolution, the newly designed building block maintains linear computational complexity without compromising accuracy, thus making it significantly faster and hardware-efficient.
Song Han, an associate professor at MIT and a senior author of the study, emphasised the potential of their development in enhancing the efficiency of real-time image segmentation locally on devices, a leap forward in the realm of autonomous vehicle technology and beyond. This development signifies a shift towards the proliferation of efficient, high-resolution computer vision applications in various domains, including healthcare for medical image segmentation.
The model series, dubbed EfficientViT, utilises a linear similarity function in place of the nonlinear one commonly used in constructing attention maps, which are central to vision transformers. This alteration preserves the global receptive field necessary for image context understanding and facilitates a balance between efficiency and performance. This harmonisation stands central to the new model’s architecture. The model has demonstrated up to nine times faster performance than its predecessors when deployed on mobile devices, maintaining similar accuracy levels. They claim that the research promises to usher in a new era of efficient AI computing, expanding the global horizons for real-time, high-resolution computer vision tasks.