Covering Scientific & Technical AI | Saturday, January 18, 2025

Is Reinforcement Learning Right for Your AI Problem? 

In the world of machine learning, reinforcement learning is an important sub-category of deep learning. In deep learning the human brain is mimicked through a hierarchical structure of human-made, artificial neural networks.

Reinforcement learning (RL) is a basic machine learning paradigm that does not require the raw data to be labeled, as is required typically with machine learning. Reinforcement learning helps determine if an algorithm is producing a correct right answer or a reward indicating it was a good decision. RL is based on interactions between an AI system and its environment. An algorithm receives a numerical score based on its outcome and then the positive behaviors are “reinforced” to refine the algorithm over time. In recent years, RL has been behind super-human performance on GO, Atari games and many other applications.

Imagine training a machine learning agent to trade stocks. One option is to provide the system with many examples of good strategies – i.e., labeled information about whether to sell a particular stock at a particular time or not. This is the well-known supervised learning paradigm. Because the agent is trying to mimic good strategies, it cannot outperform them. How can we find strategies that outperform the expert? The answer is RL.

But while RL is a powerful approach to AI, it is not a fit for every problem, and there are multiple types of RL.

Ask yourself these six questions to decide which might help you with what you are trying to solve:

  1. Does My Algorithm Need to Make a Sequence of Decisions?

RL is a perfect fit for problems that require sequential decision-making – that is, a series of decisions that all affect one another. If you are developing an AI program to win at a game, it is not enough for the algorithm to make one good decision; it must make a whole sequence of good decisions. By providing a single reward for a positive outcome, RL weeds out solutions with that result in low rewards and elevates those that enable an entire sequence of good decisions.

  1. Do I Have an Existing Model?

If you want to write a program for a robot to pick up a physical object, then you can use the laws of physics to inform your model. But if you are trying to write a program to maximize returns in the stock market, there is no existing model that can be used. Instead, you will need to use heuristics that have been manually tuned over time. But these heuristics might be suboptimal. Typically, RL is a good fit when there is no existing model to rely on or you want to improve over an existing decision-making strategy.

  1. How Much Data Do I Have? What is at Stake if a Wrong Decision is Made?

The amount of data you already have and the cost of making bad decisions may help you to determine whether to use online or offline RL.

For instance, imagine you are running a video platform, and you need to train an algorithm to offer recommendations to users. If you have no data, then you have no option but to interact with the user and make recommendation decisions in real-time, using an online process. Such exploration comes at a cost – a few bad recommendations made while the system is learning can disappoint the user. However, if you already have large amounts of data, you can develop a good policy without interacting with specific users. This is offline RL training.

  1. Does My Goal Change?

Sometimes in AI, your target never changes. With stocks, you are always going to want to maximize your returns. Such a problem is not goal-conditioned, because you are always solving for the same goal. But in other cases, your goal might be a moving target. Consider Loon, Google’s recently shuttered effort to build giant balloons to beam the internet to rural areas. Here, the optimal position for each balloon is different. For such instances, goal conditioned RL is a better fit.

  1. How Long is My Time Horizon?

So, how many decisions must my algorithm make before arriving at a solution?

The answer may help you to determine whether to use hierarchical or non-hierarchal RL. Consider writing a program to make a robot pick up an object. The robot needs to go close to the object and close its grippers to lift the object. For programs like this, with small numbers of decisions, non-hierarchical RL is often adequate. Now imagine the same robot needs to locate nails, place them on a board, then pick up a hammer and hit the nail with the hammer. At an abstract level there are only three or four steps. But if one is writing a program that outputs position of robot’s hands, it will be a long sequence of actions. In such cases with longer time horizons, hierarchical RL is often useful.

  1. Is Your Task Really Sequential Decision Making? What Information Do I Have About My Users?

Say you are looking to optimize the design of a website for selling a particular product. In some cases, a user may never return to your website. Whether the user makes a purchase may depend on the color of the website. You might show users three different color backgrounds at random and see which performs best. But if you have additional information about your users – say, their gender or geographic location – you can incorporate this information and use it to better train your AI program. Contextual bandits   is an approach for making a single decision that is tailored to these types of situations. With contextual bandits there are theoretical guarantees on performance: an algorithm can test out different actions and learn which has the most rewarding outcome for a given situation. However, if the user is going to be returning multiple times – go ahead and use RL in its most general form – alas at the cost of no theoretical guarantees.

This list of questions is by no means comprehensive. For instance, there are considerations about safety and fairness to factor in as well. But by asking these six questions, data scientists can begin to get a sense of how RL might best help them to solve their problems.

About the Authors

Pulkit Agrawal, MIT

Pulkit Agrawal is assistant professor of electrical engineering and computer science at MIT and leads the Improbable AI Lab, which is part of the Computer Science and Artificial Intelligence Lab at MIT.

Cathy Wu, MIT

Cathy Wu is the Gilbert W. Winslow Career Development Assistant Professor of civil and environmental engineering at MIT and has worked across many fields and organizations, including Microsoft Research, OpenAI, the Google X Self-Driving Car Team, AT&T, Caltrans, Facebook and Dropbox. Wu is also the founder and Chair of the Interdisciplinary Research Initiative at the ACM Future of Computing Academy.

Agrawal and Wu are also co-instructors of the MIT Professional Education course, Advanced Reinforcement Learning, which is part of the Professional Certificate in Machine Learning & Artificial Intelligence.

 

AIwire