MIT researchers have developed a streamlined method for spatiotemporal grounding, leveraging videos and automated transcripts for enhanced efficiency.
The internet offers instructional videos on various tasks, but finding specific actions in long videos takes a lot of work. Scientists aim to teach AI to locate described actions automatically, though this usually requires costly, hand-labelled data.
Researchers at MIT and the MIT-IBM Watson AI Lab have developed an efficient spatiotemporal grounding approach using videos and automatic transcripts. Their model analyses small details and overall sequences to accurately identify actions in longer videos with multiple activities. Training on spatial and temporal information simultaneously improves performance. This technique enhances online learning, virtual training, and health care by quickly identifying key moments in diagnostic videos.
Global and local learning
Researchers typically teach models to perform spatiotemporal grounding using annotated videos, but generating such data is expensive and difficult to label precisely. Instead, these researchers use unlabeled instructional videos and text transcripts from sources like YouTube, which require no special preparation. They split the training into two parts: teaching the model to understand the overall timing of actions and focusing on specific regions where actions occur.
An additional component addresses misalignments between narration and video. Their approach uses uncut, several-minute-long videos for a more realistic solution, unlike most AI techniques that use short, trimmed clips.
A new benchmark
When evaluating their approach, the researchers found no effective benchmark for testing models on longer, uncut videos, so they created one. They have developed a new annotation technique for identifying multi-step actions, where users mark the intersection of objects, like where a knife edge cuts a tomato, rather than drawing a box around important objects. Multiple people doing point annotations on the same video can better capture actions over time, such as the flow of milk being poured. Using this benchmark, their approach was more accurate at pinpointing actions and focusing on human-object interactions than other AI techniques.
The researchers plan to enhance their approach so models can automatically detect misalignment between text and narration, switching focus as needed, and to extend their framework to include audio data, given the strong correlations between actions and sounds.