Skip to main content
AI

Studies explore challenges of AI for low-resource languages

Many languages still lack sufficient data to train AI, studies find.

Man using laptop with icons for various languages overlaid

Supatman/Getty Images

4 min read

Power up. From empowering employees and streamlining processes to inspiring innovation, Microsoft’s AI is designed to help you build the next big thing. No matter where you're starting, push what's possible and build your way with Azure's industry-leading AI. Check it out.

Generative AI has made big strides in the past couple years—but much of this progress is still concentrated in the English language.

A set of recent research papers looked at the kinds of roadblocks that stand in the way for developers aiming to close that gap, especially when it comes to languages with less available text data. One big obstacle is the lack of comprehensive benchmarks—the measures that developers use to grade AI capabilities—that adequately capture the nuances of what researchers call low-resource languages.

The lack of LLMs for languages with less online data threatens to widen existing global divides, cutting off parts of the Global South from potentially transformative technology, researchers at the Stanford Institute for Human-Centered AI wrote in a recent white paper.

“Most major LLMs underperform for non-English—and especially low-resource—languages; are not attuned to relevant cultural contexts; and are not accessible in parts of the Global South,” the authors wrote.

Meanwhile, researchers from Alibaba and a team from Google and Cohere have each analyzed the shortcomings of non-English benchmarks when it comes to evaluating multilingual models. And a new dataset from prominent AI ethics researchers aims to help root out biases and stereotypes across 16 different languages.

Data dearth: The authors of the Stanford paper wrote that low-resource languages, which include Burmese and Swahili, lack both a sufficient quantity and quality of digital data when it comes to training LLMs. Data in some of these languages spans only the Bible or other religious texts, legal documents, and Wikipedia articles, which may themselves be machine-translated and are not representative of how people speak day to day, the authors wrote.

“Less than 5% of the roughly 7,000 languages spoken around the world today have meaningful online representation,” the researchers wrote.

Benchmark blues: Alibaba researchers also surveyed more than 2,000 of the multilingual benchmarks used to grade LLMs in non-English languages and learned “a bitter lesson,” as they put it. That is, “despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks.”

Simply translating existing benchmarks into other languages also doesn’t work, because it tends to lack key cultural contexts. The recent paper from Google and Cohere backs up this conclusion.

“Evaluation practices for generative abilities of [multilingual LLMs] are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development,” that paper reads.

A new dataset from a global team of AI ethicists and researchers called Shades aims to help developers solve some of these cultural context issues by identifying stereotypes and biases across 16 languages. Many other bias mitigation efforts have been concentrated on English-language contexts, they write in an accompanying paper.

Possible solutions: The authors of the Stanford paper also evaluated the pros and cons of three of the main approaches to solving this dilemma: massive multilingual models that aim to cover a huge swath of languages, regionally specific models that might target 10–20 low-resource languages, and single-language models.

There tends to be a trade-off here between cultural specificity and the lack of training data: Massive multilingual models don’t perform as well and lack cultural context in any one language—the “curse of multilinguality”—while monolingual models might come up short on training data.

Stanford’s researchers ultimately recommend fixes like strategic investments in R&D for low-resource language AI, more global inclusivity in AI research, and more equitable data ownership.

“‘Low resourcedness’ is not solely a data problem but a phenomenon rooted in societal problems such as non-diverse, exclusionary, and even exploitative AI research practices,” the authors wrote.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.