Reinforcement learning

By Tech Brew Staff

less than 3 min read

Definition:

Reinforcement learning is a method in which a model is trained via trial and error to maximize a certain reward function. The model receives feedback in the form of rewards or penalties, depending on whether or not its behavior furthers an optimal goal. Reinforcement learning differs from supervised learning in that it doesn’t involve labeled data, as well as from unsupervised learning, because there’s a desired outcome. A reward increases the likelihood the machine will deploy the same tactics again. A penalty decreases the chance the machine will repeat the behavior.

Google DeepMind used reinforcement learning to train AlphaGo, a deep learning system that was able to beat the legendary Go player Lee Sedol in 2016 as 200 million online viewers watched, a breakthrough moment for AI research.

Today, reinforcement learning is used for certain aspects of training robotics and self-driving cars in simulated environments. Companies like Netflix and Amazon have explored reinforcement learning as a way to improve recommendation engines for content and products. The method has also been explored for industrial applications, like cooling data centers or automating Amazon warehouses.

Most modern AI companies also use reinforcement learning with human feedback (RLHF) as a post-training method to improve foundation models. In this technique, human testers rate various model outputs and the system is adjusted accordingly.

DeepSeek employed a somewhat novel form of pure reinforcement learning as one of the efficiency measures that allowed it to train its R1 reasoning model at a purported fraction of the cost of leading competitors. The Allen Institute for AI used a similar reinforcement learning method to train its own open-source model that can hold its own against DeepSeek.

Definition:

Related content on Reinforcement learning

Lila Sciences aims to build AI that thinks outside the box

Google DeepMind’s new AI system can evolve new algorithms

A common hallucination-proofing measure could have unintended consequences