An artificial intelligence agent starting to learn by from its own mistakes until it is fit to handle a certain task like an expert? To many this does sound like a science like science fiction, but it is based on a simple principle called Reinforcement Learning.
The Cart Pole Problem
Recently, I found the OpenAI Gym and started playing with some of the environments. It certainly is a nice way of getting your head off kaggle.com for a while. This is a start of a series of posts describing solutions to some of the problems posted there.
As suggested on the Getting stared page I got my hands on one of the easier problems, called CartPole-v0. Basically, you have to balance a pole on a cart. Each time frame you have to choose between one of two “actions” [1;-1] and thereby move the pole either left or right. Note, the actual action set [0;1].
The first problem you have to solve is figuring out how to structure the data. Obviously, your input data should contain the four observations. Interestingly enough, since we are solving this problem by applying supervised learning, the semantics of this data is not important (black box approach). The tricky part is what comes next. You add the action [0;1] taken based on these observations as a fifth input variable. Deciding on how to represent the output variable is probably even trickier. As an output variable you take the count of time frames it takes for the episode to finish - either the pole falls on its side or you reach the maximum of 200 time frames. Ok, let’s start by defining a simple class called Cache:
As the episode starts, for each time frame cache_data is called to store the observation, the action taken and the time frame index. At the end of the episode the get_frame creates a data frame - the valuable peace of data that is later to be learned by a model. Notice the transformation of the output variable (here called future_reward) into the count of time frames it takes for the episode to finish. Next, we create a class Memory:
The Memory class holds all the data that our “AI agent” is going to use when learning. After each episode the “cache” or the short-term memory is added to the “memory” or the long-term memory. The last piece of the puzzle is adding the brain:
Putting it all together
After each episode the ‘train’ function is called - a model is fitted to the data collected so far. I won’t get into details, as there is plenty of material online on xgboost or other learning also. However, it took me quite a lot of time in order to fine tune xgboost to perform well, probably a little more than a couple of hours. Next, to learning, the brain also has to decide for an action based on observation. For the first few episodes, the brain should behave randomly. Afterward, it gradually switches to fully conscious decisions by using the regression model. Basically, the regressions model tries to predict which one of the two actions will lead to a higher count of time frames before the episode ends. The whole code is posted below, feel free to reproduce it. This solution did quite well and solved the environment after 15 episodes and only 9 seconds. You can see the behavior of the cart pole on the video below:
And here is the complete source code for the cart pole solution: