Synthetic data can help create less biased data sets—but it’s no silver bullet

A majority of AI training data could be synthetic in two years, Gartner predicts.

article cover — Carlos Castilla/Getty Images

May 6, 2022

• 4 min read

Synthetic data may not be generated by humans, but that doesn’t mean it’s automatically free of human bias.

Quick recap: As we wrote in our primer, synthetic data is data that reads, looks, or acts like it’s been collected from real people, when in actuality, it’s been created by artificial intelligence. Deep-learning algorithms train on real-world data and then do what models do best—flag patterns and trends. Then, they use those patterns to create an entirely new set of data without ties to any real individuals.

The field, seen as a privacy-preserving solution, is expanding fast: In two years, 60% of AI training data could be synthetic, Gartner predicts.

But while the field could help alleviate many privacy issues inherent to building training datasets, its synthetic nature won’t automatically solve another challenge: bias in datasets. After all, if synthetic data carries over all the same insights and trends as the original dataset, what’s to prevent it from inheriting all of the original biases, too?

Right now, the answer seems to be “very little.” But some synthetic-data organizations and researchers are working to address it via an early-stage subfield: fair synthetic data.

Some synthetic-data startups, like Mostly AI, Synthesized, and Hazy, offer clients the ability to fill out datasets with potentially more representative data, or more heavily weight certain aspects of the data than others.

“With fair synthetic data, what you can do is correct biases, where there…are some examples of what you would like to see,” Alexandra Ebert, chief trust officer at Mostly AI, a Vienna-based synthetic-data company, told us. “So, talking about the gender pay gap or income distribution, if I have some higher-earning females in the dataset, then the algorithm can learn from them and can learn what’s logical for them.”

For example, Ebert explained that the algorithm could learn how frequently high-earning women go on vacation, what their overall spending looks like, and other behavioral patterns. Then, it can generate more examples of high-earning women based on this user group, Ebert said, “without copying one given female 20,000 times, but just creating and dreaming up new high-earning females with logical spending behaviors, professions, and so on.”

But there are limitations to fair synthetic data, Ebert added: It’s easier to build new examples using cut-and-dried information, like financial transactions or salary information, than with more nuanced information, like health records. Some medical conditions can present differently in men and women—so if, for example, there's a lack of representative research into cardiovascular disease in women to begin with, there's no way for synthetic data to rectify that fundamental issue.

As the synthetic data field expands, research is developing in both the US and UK.

Last year, researchers at Amazon wrote that they designed a method to generate “fair and unbiased” synthetic data and used it to train AI models. They concluded that the models performed well, but that some improvements could be made, including the fact it would take some work to scale the algorithm for more complex datasets.

And joint research from the University of Cambridge and the University of California, Los Angeles in 2021 explored using “unfair data” to generate fair synthetic data. The researchers concluded that their method “successfully removes undesired bias,” providing “theoretical guarantees on…the fairness of downstream models.”

In 2021, Michael Platzer, the co-founder of Mostly AI, wrote in a blog post that machine learning models trained on fair synthetic data “will be fair by design.” But even that’s not necessarily true, for the same reasons that having an AI ethics team look over an AI tool before rollout doesn’t automatically nix any potential harms it could cause.

“It’s not a silver bullet in any way,” Vivek Muppalla, director of synthetic services at Scale AI, told us. He added, “These are still people who are creating some of these synthetic datasets, so we have to be very careful and measured in how we are approaching the space and always keeping the ethical considerations top of mind.”

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.