Researchers envision enabling data centers to match a user’s job request with the hardware best suitable for it, potentially conserving energy and improving system performance.
The MIT Lincoln Laboratory Supercomputing Center (LLSC) introduced its TX-GAIA supercomputer in 2019, it provided the MIT community with a powerful new resource for applying artificial intelligence to their research. The researchers aim to empower computer scientists and data center operators to understand the approach to data center optimization. They also aim to use AI in the data center to develop models to predict failure points, optimize job scheduling, and improve energy efficiency. The data center challenge invites researchers to develop AI techniques to identify 95% of accuracy the type of job that was run.
“We are fixing this very capability gap—making users more productive and helping users do science better and faster without worrying about managing heterogeneous hardware,” says Tiwari. “My Ph.D. student, Baolin Li, is building new capabilities and tools to help HPC users leverage heterogeneity near-optimally without user intervention, using techniques grounded in Bayesian optimization and other learning-based optimization methods. But, this is just the beginning. We are looking into ways to introduce heterogeneity in our data centers in a principled approach to help our users achieve the maximum advantage of heterogeneity autonomously and cost-effectively.”
Currently, LLSC offers tools using which users can submit jobs and allow them to select processors but this may result in users choosing the latest GPU which is not required by their computation. Hence, the research would enable data centers to select hardware as per the user’s job request improving system performance. Differentiating workloads may allow operators to instantly respond to discrepancies noticed due to hardware failures, inefficient data access patterns, or unauthorized usage.
“Data centers are changing. We have an explosion of hardware platforms, the types of workloads are evolving, and the types of people using data centers are changing,” says Vijay Jadepalli, a senior researcher at LLSC. “Until now, there hasn’t been a great way to analyze the impact on data centers. We see this research and dataset as a big step toward coming up with an initial approach to understanding how these variables interact with each other and then applying AI to gain insights and improvements.”
“We hope this research will allow us and others who run supercomputing centers to be more responsive to user needs while also reducing the energy consumption at the center level,” Samsi says.
Click here for the Published Research Paper