By Tech Brew Staff
less than 3 min read
Definition:
Training data encompasses any data used to teach a machine learning model to recognize patterns, make predictions, and perform tasks. It might take the form of measurement datasets, language converted into tokenized sequences, images and videos broken down into pixel values, or any other numerical representation of the world. It can be labeled or unlabeled. That training data is fed into a machine learning framework—basically, a giant equation—and tweaked over time based on a governing algorithm that interprets the data. The more training data you have and the more representative the sample is of the source material, the more accurate the model will be.
Sum of its parts
Modern large language models are trained on trillions of words from internet data, books, and other sources. Visual models are also trained on huge troves of often copyrighted images and video. Companies like OpenAI don’t generally disclose the data that makes up their datasets. While AI companies argue that their use of copyrighted materials constitutes fair use—a copyright doctrine that allows for non-permissioned use of copyrighted works in certain cases—the courts have yet to definitely decide this question, and a number of court cases between tech companies and rightsholders are set to challenge the argument.
Refining process
Modern AI companies also use a technique called reinforcement learning with human feedback (RLHF)—a protocol in which human testers rank the quality of answers. The system is then adjusted accordingly.
Modern foundation models can also be fine-tuned on smaller datasets. Through this process, an LLM already trained on a massive amount of data to give it general knowledge and an understanding of language—like OpenAI’s GPT-4o—can be honed into a system that is well-versed in, say, legal documents or job postings.
Scaling up the amount of training data fed to a foundation model has yielded more and more advanced models, as predicted in scaling laws OpenAI laid down in 2020, but there are fears this approach may soon yield diminishing returns. There’s also only so much training data publicly available in the world, and AI companies have already gobbled most of it.