· 13 min read
“This is really exciting. This is an amazing view—this is awesome!”
When the AI model spat out quotes like that, Margaret Mitchell couldn’t believe her eyes. The happy-go-lucky text was generated in response to images of the 2005 Buncefield oil factory explosion in England, which was so big it registered on the Richter scale.
It was 2015, and Mitchell was working at Microsoft Research at the time, feeding a series of images to an AI model and having it describe what it saw. The vision-to-language system used an earlier version of what we now call large language models.
The model had been trained mostly on upbeat images, like parties, weddings, and other celebrations—things people tend to photograph more often than violent or life-threatening events. Mitchell told Emerging Tech Brew this was her “aha” moment, when she realized firsthand the extent to which skewed datasets can generate problematic outputs—and dedicated her career to researching the problem.
That moment led Mitchell to join Google and, in 2017, found its Ethical AI team. But in February, Mitchell was fired from Google, just three months after her Ethical AI co-lead, Timnit Gebru, was also terminated.
While the exact reason for the firings is complex and disputed, a paper about the real-world dangers of large language models—co-authored by Gebru and Mitchell—set off the chain of events leading to the terminations. Since then, a debate over the technology—one of the biggest AI advances in recent years—has gripped the AI world.
Large language models are powerful machine learning algorithms with one key job description: identifying large-scale patterns in text. The models use those patterns to “parrot” human-like language. And they quietly underpin services like Google Search—used by billions of people worldwide—and predictive text software, such as Grammarly.
Proponents of the models say their ability to analyze reams of text has led to countless breakthroughs, from simplifying and summarizing complex legal jargon to translating plain language into computer code. But an increasing number of researchers, ethicists, and linguists say the tools are overused and under-vetted—that their creators haven’t fully reckoned with the sweeping biases and downstream harms they could be perpetuating.
One example of those harms: When researchers asked OpenAI’s GPT-3 model to complete a sentence containing the word “Muslims,” it turned to violent language in more than 60% of cases—introducing words like bomb, murder, assault, and terrorism.
GPT-3 is one of the largest and most sophisticated of these models ever created; it is generally considered to be one of the biggest AI breakthroughs in the past decade. For its part, an OpenAI spokesperson told us harmful bias is “an active and ongoing area of research and safety work” for the company.
Left unchallenged, these models are effectively a mirror of the internet: the good, the mundane, and the disturbing. With this potential to spread stereotypes, hate speech, misinformation, and toxic narratives, some experts say the tools are too powerful to be left unchecked.
It wasn’t always like this.
The modern form of “neural networks,” the AI infrastructure that powers large language models, has been around since Guns N’ Roses topped the charts. But it took three decades for hardware and available data to catch up—which means that more advanced applications for neural networks, like large language models, are a recent development.
In the past four years, we’ve seen a leap forward for large language models. They graduated from basic speech recognition—identifying individual words and the words they often appear with (e.g., “cheese” and “pizza”)—to being able to categorize a word’s context when given a list of options (“cheese” in the context of pizza, or alternatively as a photographer’s plea). Using all these observations, a model can also infer when a word is missing (“chili __ fries”), and analyze how a sentence’s meaning differs based on the order of its words.
Large language models do all of this—finish sentences, summarize legal documents, generate creative writing—at a staggering scale, “but to varying levels of quality and with no explainability,” Gadi Singer, VP and director of cognitive computing research at Intel Labs told us. In the case of Gmail's email autocomplete, that scale is nearly two billion internet users.
Models this big require an unthinkable amount of data; the entirety of English-language Wikipedia makes up just 0.6% of GPT-3’s training data.
So in the 2010s, when the vision of ever-larger models came into focus, we also saw more large companies offering services for “free” and soliciting user data in return, says Emily M. Bender, professor of linguistics at the University of Washington, who co-authored the December paper with Gebru and Mitchell.
“That comes together with the compute power catching up, and then all of a sudden it’s possible to make this leap, in terms of doing this kind of processing of a very large collection of data,” Bender told us, referencing a talk by Meredith Whittaker, co-founder and director of the AI Now Institute.
From there, language models expanded past speech recognition and predictive text to being critical in virtually all natural language processing tasks. Then, in 2018, Google created BERT, the model that now underpins Google Search. It was a particularly big breakthrough, says Singer, because it was the first transformer-based language model. That means it can process sequential data in any order—it doesn’t need to start at the beginning of a sentence to understand its meaning.
Other transformer-based megamodels began to emerge shortly thereafter, like Nvidia’s Megatron, Microsoft’s T-NLG, and OpenAI’s GPT-2 and GPT-3—primed for use not just in traditional natural language processing, but far and wide. If you've noticed your predictive text reading your mind more often, transformer-based models are why.
Tens of thousands of developers around the world build on GPT-3, fueling startups and applications like OthersideAI’s auto-generated emails, AI Dungeon’s text-based adventure game, and Latitude’s entertainment software. It’s “lowering the barrier for what an AI engineer might look like and who that person is,” OpenAI policy researcher Sandhini Agarwal told us.
A little more than a year after BERT’s release, and just six months after GPT-3’s, the news broke that Google had fired Gebru—and the previously niche debate over the ethics of these models exploded to the fore of the AI world and into the mainstream, too. Search engine traffic for “large language models” went from zero to 100. People texted their friends in the tech world to ask what the term meant, and the models made headlines in most three-letter news outlets.
Big models, bigger risks
Large language models eat up virtually everything on the internet, from dense product manuals to the “Am I The Asshole?” subreddit. And every year, they’re fed more data than ever before. BERT’s base is about 110 million parameters; its larger version ballooned to 340 million. OpenAI’s GPT-3 weighs in at a hulking 175 billion parameters. And in January, Google announced a new language model that’s almost six times bigger: one trillion parameters.
Plenty of this data is benign. But because large language models often scrape data from most of the internet, racism, sexism, homophobia, and other toxic content inevitably filter in. This runs the gamut from explicitly racist language, like a white supremacist comment thread, to more subtle, but still harmful connections, like associating powerful and influential careers with men rather than women. Some datasets are more biased than others, but none is entirely free of it.
Stay up to date on emerging tech
Drones, automation, AI, and more. The technologies that will shape the future of business, all in one newsletter.
Since large language models are ultimately “pattern recognition at scale,” says Bender, the inevitability of toxic or biased data sets begs an important question: What kinds of patterns is it recognizing—and amplifying?
Google did not respond to a request for comment on these concerns and how the company is addressing them.
OpenAI admits this is a big issue. Technical Director for the Office of the CTO Ashley Pilipiszyn told us that biased training data, taken from the internet and reflected in the model, is one of her concerns with GPT-3. And although many of GPT-3’s users input relatively clean prompts—e.g., “Translate this NDA into a simple paragraph”—if someone does prompt GPT-3 to issue toxic statements, then the model will do just that.
That “lack of controllability,” in terms of those who build and use the model, is a big concern for Agarwal. While GPT-3 is increasing tech accessibility, “it’s also lowering a barrier for many potentially problematic use cases,” she told us. “Once the barrier to create AI tools and generate text is lower, people could just use it to create misinformation at scale, and having that data coupled with certain other platforms can just be a very disastrous situation.”
The company regularly publishes research on risks and does have safeguards in place—limited access, a content filter, some general monitoring tools—but there are limits. There’s a team of humans reviewing applications to increase GPT-3 usage quota, but it’s just six people. And although OpenAI is building out a tougher use-case monitoring pipeline in the event that people lie on those applications, says Agarwal, it hasn’t yet been released.
Then there are the environmental costs. The compute—i.e., pooled processing power—required to train one of these models spikes by a factor of ~10 every year. Training one large language model can generate more than 626,000 pounds of CO2 equivalent, according to one paper. That's more than the average human’s climate change impact over a period of 78 years.
“This rate of growth is not sustainable,” Singer said, citing escalating compute needs, costs, and environmental effects. “You can’t grow a model every year, 10x continuously, for the next 10 years.”
Finally, there's the fact that it's virtually impossible to build models of this size for the “vast majority of the world’s languages,” says Bender—simply because there’s much less training data available for anything besides American English. As a result, the advancing technology naturally excludes those who want to harness its powers in other languages.
“In Denmark, we’re really struggling to get enough data to build just a simple BERT,” says Leon Derczynski, a Copenhagen-based machine learning and language scientist. “As a volunteer project...it’s really, really tough.” He says that although people don’t think they should stop working toward a Danish BERT, at the same time, “the local attitude is almost kind of, ‘Why don't we just...make the people use English?’ So that sucks.”
From a business perspective, that means entrepreneurs in non-English-speaking countries have few options when it comes to using existing large language models: build in English, build a big, expensive local corpus, or don’t use them at all. Bender also worries about the social costs of this dynamic, which she thinks will further entrench English as the global norm.
Ask any AI leader which type of model or solution is best moving forward, and they’ll ask you something right back: Best for what?
“What’s the problem we’re trying to solve? And how is it going to be deployed in the world?” Says Bender. And when it comes to environmental costs, she adds, it’s not just a question of whether to use a large or small language model for a certain task: “There’s also the question of, Do we need this at all? What’s the actual value proposition of the technology? And…Who is paying the environmental price for us doing this, and is this fair?”
Experts we spoke with agreed that the process typically should start with questions like these. Many tasks don’t needa language model at all, and if they do, they may not need a large one—especially considering the societal and environmental costs.
Though the industry is trending toward ever-larger language models with more and more data, there’s also a smaller-scale movement to reduce the size of individual models and their architecture. Think of it as tidying up after the fact, Marie Kondo-style, and throwing out unused, inefficient, or low-value content.
“By using compression, or pruning, and other methods, you can make a smaller model without losing much of the information, especially if you know what you’re going to use it for,” says Singer. One startup, Hugging Face, is attempting to distill BERT down to 60% of its base size while preserving 95% of its performance.
Smaller data sets are also more manageable—they will still contain biases and issues of their own, but it's easier to identify and address them than it is with a model the size of the entire internet. They’re also lower-cost, since every megabyte of training data—no matter the data quality, or how much new information the model is really learning from it—costs the same to process.
Right now, there’s the “underlying assumption...that if we just grab everything [from the web as-is], we will have the right view of the world; we’ve now represented all of the world,” says Mitchell. “If we’re serious about responsible AI, then we also have to do responsible dataset construction.”
That means that instead of pulling data that reflects the world as it’s represented on the web, it’s important to define constraints and criteria that actively counter racism, sexism, and a range of societal ills within data—and designate people who are accountable for every step of that process.
And it’s vital to do that sooner rather than later, according to research from OpenAI and Stanford. Open source projects are already attempting to recreate GPT-3, and the developers of large language models “may only have a six- to nine-month advantage until others can reproduce their results,” according to the paper. “It was widely agreed upon that those on the cutting edge should use their position on the frontier to responsibly set norms in the emerging field.”
MIT Professor Jacob Andreas told us students are already working toward more accountability in training data: designing their own data sets in course exercises, asking others to annotate them for potential issues, and choosing data more thoughtfully.
But for the most part, Mitchell says this kind of dataset accountability is an area of research that has yet to be explored.
“It’s only once you have the need to do something like this that people are very brilliant and creative and come up with ways to tackle it,” she says. “The goal is to make it clear that it’s a problem that needs to be solved.”
Note: Updated to reflect a revised number of team members from OpenAI. The team that reviews increased usage quota applications for GPT-3 is 6 people, not 3 as originally told to us.
Stay up to date on emerging tech
Drones, automation, AI, and more. The technologies that will shape the future of business, all in one newsletter.