– Cohere for AI has unveiled Aya, an open-source large language model (LLM) supporting 101 languages, which is more than existing open-source models.
– The Aya project involved over 3000 collaborators from 119 countries.
– The Aya dataset, a collection of human annotations, was also released to improve model performance with less training data.
– The Aya model surpasses the best open-source models in performance on benchmark tests and expands coverage to previously unserved languages.
– Aya’s data is considered rare and can be used to create models for subsets of languages.
– Building high-quality data sources for non-English languages is crucial for supporting LLMs.
– A global research community and support from governments are necessary for preserving languages and cultures in the AI world.
– Aya model and datasets are available on Hugging Face.
Today, Cohere for AI, the nonprofit research lab established by Cohere in 2022, unveiled Aya, an open-source large language model (LLM) supporting 101 languages — more than twice the number of languages covered by existing open-source models.
The researchers also released the Aya dataset, a corresponding collection of human annotations — this is key because one obstacle to training less common languages is that there is less source material to train on. But according to Cohere for AI, the lab’s engineers also found ways to improve model performance with less training data.
The Aya project, which was launched in January 2023, was a “huge endeavor” with over 3000 collaborators around the world, including teams and participants from 119 countries, said Sara Hooker, VP of research at Cohere and leader of Cohere for AI.
With over 513 million instruction fine-tuned annotations (data labels to help classify information), “I don’t think we knew at the time was quite how enormous it would be as a project,” Hooker told VentureBeat in an interview, calling this kind of data is the highly-valuable “gold dust” that goes on at the end of the LLM training (as opposed to pre-training data scraped from the internet).
The AI Impact Tour – NYC
We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.
Ivan Zhang, co-founder and CTO of Cohere, posted on X that “we’re releasing human demonstrations across 100+ languages to further scale intelligence and ensure that it serves more of humanity than just the english literate world,” calling it “yet another impossible scientific and operational feat achieved by” Hooker and the Cohere for AI team.
Potential of LLMs for languages and cultures largely ignored
According to a Cohere blog post, The new model and dataset is meant to help “researchers unlock the powerful potential of LLMs for dozens of languages and cultures largely ignored by most advanced models on the market today.”
Cohere for AI said that it benchmarked the Aya models performance against available, open-source, massively multilingual models. It surpasses the best open-source models, such as mT0 and Bloomz, in performance on benchmark tests “by a wide margin,” and expands coverage to more than
50 previously unserved languages, including Somali and Uzbek.
Hooker pointed out that any model with above six languages is typically considered “extreme” in terms of multilingual performance, and that once there are about 25 languages, “that’s ‘massively multilingual’ — there are only a few models that actually tackle that many languages and report performance on them.”
A data ‘cliff’ outside of English
That means that there is a data “cliff” of sorts outside of English fine-tuning data, Hooker explained, so Aya’s data is “incredibly rare.”
“What I expect will happen is that people will select languages that they want to share from this dataset, and they will be able to iterate and create models which serve subsets of languages and and that’s a huge need,” she said. “But the biggest divide I see right now technically is precisation. These models have been used all over the world and so people want it to work for them. And they want to personalize — and part of that just requires data in different languages.”
Aleksa Gordic, previously a researcher at Google DeepMind, is currently building a full stack generative AI platform for language-specific LLMs and developed YugoGPT, an LLM that he says outperformed Mistral and Llama 2 for Serbian, Bosnian, Croatian, and Montenegrin.
“I definitely think that Aya and all similar multilingual data efforts are crucial,” he told VentureBeat. “LLMs feed on data and if you want to support non-English languages you need high quality and ideally abundant data sources for that target language of interest so you can build high quality LLMs.”
The effort is “definitely not enough,” he added, but “is a step in the right direction.” A global research community is needed to work on this, he explained, “and we also need support from governments around the world to understand the importance of building large and high quality data sources. That way you preserve your language, your culture in the brand new AI world.”
Cohere for AI’s Aya model and datasets are already available on Hugging Face.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
AI Eclipse TLDR:
Cohere for AI, the nonprofit research lab established by Cohere, has unveiled Aya, an open-source large language model (LLM) that supports 101 languages, more than twice the number covered by existing open-source models. In addition to the language model, the researchers have also released the Aya dataset, which consists of human annotations. This is crucial because training less common languages is challenging due to the limited source material available. The Aya project, launched in January 2023, involved over 3,000 collaborators from 119 countries. With more than 513 million fine-tuned annotations, the project has exceeded expectations. The Aya model outperforms other open-source models and expands coverage to over 50 previously unserved languages. The release of Aya is aimed at unlocking the potential of LLMs for languages and cultures that have been largely ignored by existing models. However, there is still a data “cliff” outside of English fine-tuning data, making Aya’s data incredibly valuable. The Aya model and datasets are already available on the Hugging Face platform.