Module2 - mtrl - Reinforcement learning

This page gives a very brief introduction of reinforcement learning (RL). We will treat reinforcement learning in more detail in the second AS course that runs next semester. RL is currently such a vibrant area of research and a topic where so many different fields of research meet. We therefore think that it is important that you at least get to know the very basics. The first myth to kill is that reinforcement learning is something new. Early papers such as "Steps Towards Artificial Intelligence", Marvin Minsky, 1961, Proceedings of the IRE. 49 Links to an external site.: 8–30 Links to an external site.already discusses many of the challenges that are discussed today. 

You may have heard of supervised and unsupervised learning, reinforcement learning is often considered to be the third kind of machine learning paradigm to complement these two, where the learning takes place via interaction and not examples presented to the agent.

The book "Reinforcement Learning: An Introduction" by Sutton and Barto from 1998 is often used as the text book for courses in RL. It is freely available online and a second edition is in the making (http://www.incompleteideas.net/book/the-book-2nd.html Links to an external site.). In the words of Sutton and Barto "Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning." 

To find good solutions the agent must try new things, i.e. it must explore. This is especially true in the beginning where it does not know anything. With time it should also exploit the current knowledge about what are good actions to take. There is a constant trade-off between exploration and exploitation. Since exploration can be impossible or dangerous, much research is invested in trying to come up with ways for the system to do much of the exploration in simulation. This is also important when dealing with hardware, as many RL methods would require many thousands or even millions of iterations which would wear out the hardware in no time.

One of the more spectacular recent achievement of RL is AlphaGo developed by DeepMind Links to an external site., but RL is also used frequently in robotics to learn motor skills for example. 

Reinforcement learning basics

You can start by taking a look at the following two introductory videos. They are connected to the free online course on RL from Udacity Links to an external site..

Introduction to Reinforcement Learning Links to an external site.Introduction to Reinforcement Learning

Reinforcement Learning Basics Links to an external site.Reinforcement Learning Basics

DeepMind also has video lectures Links to an external site. on the subject

The two main elements of reinforcement learning are

  • The Agent which is the one making the decisions on what to do and
  • the Environment in which the agent operates.

The agent is trying to achieve some kind of goal in an environment about which the agent is uncertain in various ways. Reinforcement learning requires interaction between the agent and the environment. The actions performed by the agent influence the environment and thereby the future of the agent. An action might affect the environment in ways which is not immediately observable. The environment is characterized by the state it is in. Getting this state information typically requires sensing and perception and you can often not sense the full state of the environment. Also, exactly what the effects will be of actions on the environment is also not known precisely.

MDPs as described in Module2 - MDPs and POMDPs is the mathematical framework used to model RL problems. In contrast to a standard MDP problem we do not, typically, assume that we know how the environment works (for example the transition function). The agent performs actions and learns how the world works and what actions to perform.

In addition to agent, environment and state, the following elements are central

  • The behavior of the agent is determined by the policy which maps a state to an action, that is, given a certain state the policy tells what the agent should do. It could be represented as a lookup table which simply lists for all states what action to take, but in general it is a more complex function which could involve a lot of computations to evaluate. 
  • The reward function is the basis for defining the objective of the agent which is to maximize the total reward that the agent receives. It assigns a numerical value to pairs of states and actions. That is, the reward is a function that tells how good (in an immediate sense) it is to perform an action a in state s. 
  • The value function captures the expected total reward from a state. It is important to note that in RL it is the reward in the long run that matters and not just the reward in a certain state. This long run perspective is what gives the agent the ability to look beyond what is rewarding in the short run and look at the big picture. Rewards are given directly by the environment given a state and action, whereas the value is something that we need to estimate. Once we have a good estimate of the value function we can derive our policy from that.
  • In some cases we make use of a model of the environment which allows us to make explicit predictions about future states. Models allow the agent to perform planning. If an RL method makes use of a model, it is referred to as model-based method, if no model is used it is referred to as a model-free method.

Q-learning

Q-learning is one of the basic RL methods. The basic structure for this method is the Q function which gives the value for different actions in a certain state (see Module2 - MDPs and POMDPs). This can be represented as a matrix with states in the rows and actions in the columns. In Q-learning we initiate the Q matrix as best we can and then we let the agent perform actions guided by the Q-matrix and update the value based on the received rewards. 

How to use Q Learning in Video Games Easily Links to an external site.How to use Q Learning in Video Games Easily

 

Deep Reinforcement Learning

A significant part of the work on RL today falls into the category of deep RL where one or more of the policy, the value function or the model are represented by a deep neural network. 

Deep Q Learning for Video Games - The Math of Intelligence #9 Links to an external site.Deep Q Learning for Video Games - The Math of Intelligence #9

 

Inverse Reinforcement Learning (IRL)

For some problems it can be hard to define a good reward function. Setting up the reward function typically involves a lot of tweaking to get the desired behavior. In some of these cases inverse reinforcement learning (IRL) can be used where we instead try to infer the reward function by looking at demonstrations of some kind. This is natural if, for example, there already are solutions to the problem such as humans performing the task. We can then let the system learn from these examples and come up with an approximation to the reward function, typically defined as a function of features defined by the state of the system. This is also referred to as apprentice learning and one of the often referred to early examples of this is the work at Stanford with their helicopter learning how to perform stunt like manoeuvres.

Autonomous Helicopters Teach Themselves to Fly Stunts Links to an external site.Autonomous Helicopters Teach Themselves to Fly Stunts

 

OpenAI Gym

To get your hands dirty with some RL we will use the OpenAI Gym. Again, we will get back to RL in the next course so this is intentionally kept rather brief and without going into the details too much. Looking at the examples, you might think "I could have done that easily with a simple model and what I learned in the basic control course". Yes, this is true, but the promise of RL is that general methods can be used to solve a large number of problems.

 

Installation of OpenAI Gym

In what follows below we will assume that you are using the WASP VM used in module1 or the external disk. If you have another (unsupported) setup, follow the instructions at https://github.com/openai/gym Links to an external site..

Open a terminal and run

sudo apt-get install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig
cd ~/software/
git clone https://github.com/openai/gym.git
cd gym
pip install -e .
pip install -e '.[box2d]'
pip install -e '.[classic_control]'

 

Now let us try the CartPole-v1 environment 

cd ~/software/gym
python examples/agents/keyboard_agent.py CartPole-v1

You control the actions manually in this case. The actions are discrete and corresponds to left or right in this case, mapped to the keys 0 and 1. Each iteration that you manage to keep the pole not leaning too much you get a reward of 1. An episode is over if when the pole falls too much, the cart moves too far or after 200 iterations. Can you control it well enough to get 200 consistently?

Install keras-rl

OpenAI Gym mainly defines benchmarks. We will download some RL agents from elsewhere.

cd ~/software/
git clone https://github.com/keras-rl/keras-rl.git
cd keras-rl
pip install keras-rl
pip install h5py
pip install tensorflow

Now we let us try one of the agents that implements DQN which was used by DeepMind to play play Atari games (https://arxiv.org/abs/1312.5602 Links to an external site.).

cd ~/software/keras-rl
python examples/dqn_cartpole.py

You will see how the agent gradually improves and how in about 20min will have learned to balance the pole quite well. You can speed up the learning by disabling the visualisation. Look for the line dqn.fit(env, nb_steps=50000, visualize=True, verbose=2) and change from visualize=True to visualize=False. This will only visualize 5 episode with the learned agent at the end.

You can also try a traditional (non-deep) RL method called SARSA with 

cd ~/software/keras-rl
python examples/sarsa_cartpole.py