The Technology Innovation Institute (TII) has announced the debut of NOOR, the world’s largest Arabic natural language processing (NLP) model to date.
The NOOR model can perform activities outside of the domain of language, such as crawling, filtering, and curation at scale, and provides an end-to-end pipeline of high-quality data. The model enables large-scale distributed training and serving, allowing applications to be delivered with efficient inference and model specialisation. To create it, TII’s team of advanced researchers and Artificial Intelligence (AI) experts have teamed up with LightOn, a technology startup that enables extreme-scale machine intelligence for enterprises.
Dr Ebtesam Almazrouei, Director, AI Cross-Center Unit at TII, said, “Large language models have taken the world of natural language processing by storm, and we are proud to introduce this cutting-edge model with 10 billion parameters – the world’s largest Arabic NLP model.”
The NOOR model, according to Dr Almazrouei, is based on the popular Transformer architecture. It is programmed to solve generative tasks as a decoder-only model, similar in structure to GPT-3, with the architecture modified to reflect the newest breakthroughs in the realm of machine learning, including features such as better positional embeddings. The TII team created an automated filtering pipeline based on machine learning approaches to ensure quality at scale in the NOOR dataset. These techniques recognise content like quality references and protect the model against spam.
NOOR was trained on a High-Performance Computing resource with 128 A100 GPUs using state-of-the-art 3D parallelism, allowing for the distribution of computations and efficient utilisation of the available hardware resources.
NOOR’s unique dataset of more than 30 billion words mixes web data with literature, poetry, news stories, and technical knowledge to considerably broaden the applicability of the model. It is the world’s largest high-quality cross-domain Arabic dataset.
Speaking on the launch, Professor Mérouane Debbah, Chief Researcher, Digital Science Research Center and AI Cross-Center Unit at TII said, “With NOOR, TII has expanded the scope of the modern standard Arabic model by leveraging know-how in large language models to build cross-disciplinary, cutting-edge expertise in this new generation of AI research.”