Big companies are banding together to better label AI data

Their new standards aim to curtail data-related woes for companies diving into AI.

December 4, 2023

· 3 min read

Data is often cited as the No. 1 wrinkle for companies looking to get into AI. Data scientists can spend nearly half their time wrangling it into shape, and three in five CEOs say a lack of clarity around data provenance is a barrier to implementing generative AI tools.

Many major corporations are hoping to ease some of those headaches with a new set of standards designed to clarify where and how the data they use was collected. Nike, IBM, UPS, Mastercard, and Walmart are among the more than 25 companies that contributed to developing the guidelines as part of a nonprofit consortium called the Data & Trust Alliance.

Companies have been grappling with big tracts of data for years, but the situation has turned perhaps more urgent in the past year as many businesses have tried to build internal tools around large language models (LLMs). Fine-tuning those models to perform specialized tasks can take a huge amount of internal information, raising new concerns around data privacy and responsibility.

The alliance claims it will be the first to create standards around data provenance that apply to multiple industries.

“The creators of AI platforms are not the only players in this inflection point,” Ken Finnerty, president of IT and data analytics at UPS, said in a statement accompanying the announcement. “Enterprises in every industry are deploying data and intelligent systems that are core to their business…Data provenance is critical to those efforts.”

There are eight proposed standards in all, which cover areas like metadata, legal rights, privacy, generation date, data type, generation method, intended use, and restrictions and lineage. Some of these standards already exist but are not consistently included in metadata accompanying datasets, the alliance said in the announcement. Others, like intended use and “provenance metadata unique ID,” will be entirely new, according to the group.

The alliance is also hoping that data vendors and other partners will adopt the standards, making it easier for companies to know what kinds of datasets they are buying or using. Companies in the group are currently testing the standards in areas like supply chain, compliance, and healthcare, the announcement said.

The Data & Trust Alliance isn’t the only way that big companies are banding together to better track data provenance in the generative AI era. A separate group of companies led by Adobe, called the Content Authenticity Initiative, has convened to create and maintain standards around labeling alterations to images and other media.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.