How Google’s forum moderation tool makes AI chatbots nicer

A Q&A with Google Jigsaw Engineering Manager Lucy Vasserman.

June 22, 2023

• 7 min read

What happens when you bring together the NiCEst person on earth and the best AI platform for customer service? You get service that just gets you. Kristen Bell + the smartest AI platform = service that’s seamless, personal, efficient… it’s so NiCE.

When Google incubator Jigsaw first launched its AI-powered Perspective API, it was intended to help websites host less toxic forum discussions and comment sections.

But the free tool has more recently been put to another use: reducing the incidence of offensive language generated by a new wave of large language models (LLMs).

Companies like Google, Meta, OpenAI, and Anthropic have used the tool to help evaluate the toxicity of their AI’s output and cut down on harmful training data, according to Lucy Vasserman, who leads engineering and product at Jigsaw.

These kinds of problems are foundational for Jigsaw, which focuses on creating tech that combats threats to open society, like misinformation, toxicity and hate, online extremism, and internet censorship.

Vasserman spoke with Tech Brew about reining in language models, the state of online discussions, and AI threats on the horizon.

This conversation has been edited for length and clarity.

How has your job changed since ChatGPT first took off and created this huge boom around generative AI?

In our work on toxicity and harassment, we’ve got Perspective API. That’s a machine learning tool to recognize toxicity in online comments used by moderators all across the internet. Folks like the New York Times, Reddit, a range of different news sites and forums, are using it to help their moderators flag content for review.

When we built Perspective API—we originally launched in 2017—we were really thinking about people talking to each other, and how do we protect humans in conversation. And so what’s changed as LLMs and generative models have become more common and more popular? We’ve really seen a new use case for Perspective API emerge, where researchers and developers are using Perspective to protect their generative models from producing toxicity in their conversations with people. So it’s been cool for us to see something that we originally created for one purpose—helping humans moderate and helping people talk to each other—and realize that it’s extremely valuable in another scenario as well: Helping machines have effective conversations with people.

What are some examples of how that works?

When you’re building a generative model, there are several different times in the model development process that you might think about addressing toxicity. The first is once you already have a model. If we’re worried that that model might produce toxicity and be harmful to the people interacting with it, the first thing we need to do is be able to measure that, to test how much a model is doing that. And academic researchers at the Allen Institute and the University of Washington created something called Real Toxicity Prompts, which is a benchmark dataset and process to evaluate a model for how likely it is to create toxicity. Those are simple half sentences or prompts, the beginning of a sentence that you feed into the model, and those sentences are designed to provoke the model into saying something toxic. And then you get all the completions that your generative model creates, and you send all of those into Perspective to gauge how likely they are to be considered toxic by a human. You can actually get a quantitative measure of the propensity of that model to produce toxicity.

That has become really the standard evaluation technique for toxicity in LLMs and in generative language models. And that has been used at Google for our models, but it’s also been used by OpenAI, by Meta, by Anthropic, many of the big players. When folks publish academic research, this is one of the ways that they’re evaluating those models.

That’s how we measure the problem of toxicity, but how do we address it? What can you do about it? One time you might address it is while you’re training the model; you might use Perspective API to judge the toxicity of all of your training data, and maybe you remove or reduce the amount of toxic training data that’s going into the model. And then your model doesn’t have as much opportunity to learn about toxicity and therefore is less likely to produce it…And there’s some caveats with that. It’s easy to overdo it. So you need to be really careful when you do that, you don’t want to just remove too much data, because you can get other side effects that are not what we’re looking for. Specifically, side effects around creating bias in the model. And so toxicity frequently appears alongside discussion around identity or specific identity terms. Because online, certain identities are so frequently harassed that removing toxicity means you may also remove too much content around those identities. And then your generative model may not be able to talk about identities…The model might think a word like “gay” or “Muslim” is a toxic word, even though those words are not at all inherently toxic. So in Perspective, we’ve addressed that by adding additional data where we’re using those words and making nontoxic context to make sure the model can distinguish. And we test our model for that type of bias, and we publish that information.

As all these companies get into foundational models and LLMs, are they doing a good job of being mindful about toxicity?

As an industry and as an academic research space, there’s always more we can do; it’s ever-evolving. If you think back to previous iterations of AI and machine learning, the conversations that we’re having about the risks of these models, it’s not that they’re spewing racial slurs unprompted. So I think we’ve made some progress.

Toxicity is a space that we at Jigsaw and the industry have been working in for a while. It’s by no means solved. But we’ve made a lot of progress. And we have some tried-and-true techniques that we know can mitigate the problem. And so the industry is doing a solid job of making sure that we use the techniques that we already know about in this new technology. And that, you know, we’re having new conversations about the new risks that we’re uncovering, and we’re not going backwards and having the same problem over and over and over again.

Going all the way back to GPT-2 in 2019, there has been talk about how this could unleash a lot of misinformation at scale into the world. And now that it’s more widely available than ever, is that still something that you’re worried about? Has that come to pass to any extent?

Hallucination and accurate information is a much more difficult and more nuanced challenge. It’s not what we’ve solved for in the past. It’s a newer area for us…We don’t currently have machine learning models that will recognize accurate information. I don’t believe that’s possible. So when I talk about the problems that we’re thinking about are new ones that we don’t yet know how to solve, this is exactly one of them. We are starting to think about the information quality in users’ conversations with each other, and a little bit around how Perspective has been used widely to take out the bad content from conversations, but we’re starting to think about, “What is the good content?” And how can we uplift high-quality conversation and high-quality comments? It’s very early-stage work for us.

As you look toward the future, what are some of the biggest threats on the horizon that you’re looking at as LLMs roll out more widely?

We are thinking a lot about how bad actors use LLMs. What happens when trolls or folks who are spreading misinformation are using this technology? How does that change our online conversation spaces? That’s a very, very pressing conversation on the horizon for us.

And we have seen folks posting in the comment sections under our news articles—generated comments. It’s an interesting problem, because we do believe that this technology is beneficial, and there’s a lot of really valid reasons why somebody might use generative AI technology to help them write a comment for a news article—maybe English is a second language, and they’re posting on an English-language site, and they want support with framing that comment. Or maybe they want to summarize it or make it shorter. There’s a ton of valid reasons why generative AI might be part of a comment posting journey. But there’s a lot of misuse of that type of technology as well.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.