Introduction to Q-Learning with Python and Open AI Gym (2024)

The code that accompanies this article can be downloadedhere.

In the previous article, we got familiar with reinforcement learning and the problem it is trying to solve. Reinforcement learning is the third paradigm or third type of learning in the universe of artificial intelligence. The other types of learning like supervised and unsupervised learning were covered on this site as well, so we decided to write a little bit about this completely different approach. In general, when we are talking about reinforcement learning we are talking about some type of interaction between self-learning agent and the environment.

Introduction to Q-Learning with Python and Open AI Gym (1)

The agent is trying to achieve some kind of goal inside of the environment while it interacts with it. The whole interaction is divided into time steps. In every time step, the agent performs certain actions, changes the state of the environment and based on the success of its action it gets a certain reward. This way the agent learns what actions should be performed an which shouldn’t in a defined environment state. The agent also associates values with each state. Basically, every state has a prediction of future rewards. Based on this information the agent creates a policy (maps state of the environment to the desired action) and value function (long-term goal of the agent). This approach is often split into separate episodes, like the separate games of chess, with reward only at the end of each episode. The whole process is simplified in the image below.

Introduction to Q-Learning with Python and Open AI Gym (2)

We also got familiar with the mathematical formalization of the previously mentioned process – Markov Decision Processes (MDPs). This framework gives us a formal description of the problem in which an agent seeks for the best policy, defined as the functionπ. Inside of the MDPs, we recognize several concepts, like a set of states – S, a set of actions – A, expected reward that agent will get going from one state to another – Ra(s, s’), etc. For more information about MDP, check this article.

There are two important concepts we need to emphasize. The first one is the value of a statesVπ(s) while following policy π. It is defined as theexpected rewardwhen starting in the states:

Introduction to Q-Learning with Python and Open AI Gym (3)

where γ is the discount factor and it determines how much importance we want to give to future rewards. The second important concepts defined within this framework is thevalueof taking actionain statesunder policy π. It is represented as an expected reward, that agent will get if it starts from states,takes the actionaand follows policy π –qπ(s, a):

Introduction to Q-Learning with Python and Open AI Gym (4)

This is just the introduction of reinforcement learning. If you want to learn more you can check out our previous article. Anyhow, this is a rough representation of the problem. In this article, we will try to solve this problem by using Q-Learning, which is the simplest form of reinforcement learning.

Temporal-Difference Learning

Introduction to Q-Learning with Python and Open AI Gym (5)

Q-Learning is part of so-called tabular solutions to reinforcement learning, or to be more precise it is one kind of Temporal-Difference algorithms. These types of algorithms don’t model the whole environment and learn directly from environments dynamic. Apart from that, they update their estimates based on previous estimates, so they don’t wait for the final outcome of the process. The most simple form of Temporal-Difference is usually denoted as TD(0). It is presented with the mathematical formula:

Introduction to Q-Learning with Python and Open AI Gym (6)

where α is a learning rate. This means that this approach will wait until next time step t+1 and use reward and estimated value from that time step to update the value of the time step t. TD(0) is performed like this:

  1. Initialize value for each state from the set of states S arbitrary:
    V(s) = n, ∀s ∈ S.
  2. Pick the action a, from the set of actions defined for that state A(s) defined by the policy π.
  3. Perform action a
  4. Observe reward R and the next state s’
  5. Update value for the state using the formula:
    V (s) ← V (s) + α [R + γV(s’) − V (s)]
  6. Repeat steps 2-5 for each time step until the terminal state is reached
  7. Repeat steps 2-6 for each episode

As you can see this approach learn their estimates in part on the basis of other estimates. This is what more experienced artificial intelligence engineers like to call bootstrapping. It is referring to a situation in which the algorithm learns a guess from a guess. What does this have to do with Q-Learning? Well, Q-Learning is going one step further from Temporal-Difference Learning. In fact, they are not just learning how to guess from the other guess, but they are doing this regardless of the policy.


Introduction to Q-Learning with Python and Open AI Gym (7)

As you could see Temporal-Difference Learning is based on estimated values based on the other estimations. Q-Learning is going one step further, it is estimating the aforementioned value of taking actionain statesunder policy π – q. That is how it got its name. Basically, decisions of this approach are based on estimations of state-action pairs, not state-value pairs. The simplest form of it is called one-step Q-Learning and it is defined by the formula:

Introduction to Q-Learning with Python and Open AI Gym (8)

A Q-Value for a particular state-action combination can be observed as the quality of an action taken from that state. As you can see the policy still determines which state–action pairs are visited and updated, but nothing more. This is why Q-Learning is sometimes referred to as off-policy TD learning.

All these Q-Values are stored inside of the Q-Table, which is just the matrix with the rows for states and the columns for actions. This table is updated by the agent and looks something like this:

Introduction to Q-Learning with Python and Open AI Gym (9)

Let’s brake down Q-Learning into the steps:

  1. Initialize all Q-Values in the Q-Table arbitrary, and the Q value of terminal-state to 0:
    Q(s, a) = n, ∀s ∈ S, ∀a ∈ A(s)
    Q(terminal-state, ·) = 0
  2. Pick the action a, from the set of actions defined for that state A(s) defined by the policy π.
  3. Perform action a
  4. Observe reward R and the next state s’
  5. For all possible actions from the state s’ select the one with the highest Q-Valuea’.
  6. Update value for the state using the formula:
    Q(s, a) ← Q(s, a) + α [R + γQ(s’, a’) − Q(s, a)]
  7. Repeat steps 2-5 for each time step until the terminal state is reached
  8. Repeat steps 2-6 for each episode

After some time and enough random exploration of actions, the Q-Values tend to converge serving our agent as an action-value function we mentioned previously. The important thing to note is that sometimes we add additional constraint in order to stop overfitting. Basically, we use value epsilon which defines will we explore new actions and maybe come up with a better solution, or we will go with the already learned route. This parameter defines the relationship between the exploration of the new options and exploiting already learned options. Let’s now see how we can write a simple Python code for this approach.


Introduction to Q-Learning with Python and Open AI Gym (10)

The code that accompanies this article can be downloadedhere.


In order to run this code, you have to have Python 3 installed on your machine. In this example, we are using Python 3.7. Also, you have to install Open AI Gym or to be more specific Atari Gym. You can install it by running:

pip install gym[atari]

If you are using Windows it is not this straight forward, so you can follow this article in order to properly install it.

Open AI Gym has its own API and the way it works. Since that is completely another topic, we will not go in depth of how interaction with the environment from the code is done. We will mention a few important topics as we go on that are important for understanding the code. However, we strongly suggest you can check out this article if you are not familiar with the concept and the API of Open AI Gym.


We’ll be using the Gym environment calledTaxi-V2. This is one very simple environment, almost like ‘Hello world’ example. It was introduced to illustrate some issues in hierarchical reinforcement learning. In essence, there are 4 locations in the environment and the goal is to pick up the passenger at one location and drop him off in another. The agent can perform 6 actions (south, north, west, east, pickup, dropoff). You can find more information about it here.

Introduction to Q-Learning with Python and Open AI Gym (11)

Let’s see how we can solve it using Q-Learning.


We will first import necessary libraries and modules:

import numpy as np
import random
from IPython.display import clear_output
import gym

view raw hosted with ❤ by GitHub

As you can see, we are importing numpy module – Python’s module for numerical operations and gym module – Open AI Gym library. Apart from that, we are using random and IPython.display for simple operations. Then we need to create an environment. That is done like this:

enviroment = gym.make("Taxi-v2").env
print('Number of states: {}'.format(enviroment.observation_space.n))
print('Number of actions: {}'.format(enviroment.action_space.n))

We use the make function to instantiate an object of the environment we want. In this example, that is Taxi-v2 environment. We can display the current state of the environment and the agent with the render method. The important thing is that we can access all states of the environment using observation_space property and all actions of the environment using action_space. Here we have 500 states and 6 possible actions. Apart from these methods, Open Gym API has two more methods we need to mention. The first one is the reset method which resets the environment and returns a random initial state. Another one is the step method which steps the environment by one timestep and performs an action.

Now we can proceed with the training of the agent. Let’s first initialize necessary variables:

alpha = 0.1
gamma = 0.6
epsilon = 0.1
q_table = np.zeros([enviroment.observation_space.n, enviroment.action_space.n])

view raw hosted with ❤ by GitHub

Then we run the training using steps we mentioned in previous chapter:

num_of_episodes = 100000
for episode in range(0, num_of_episodes):
# Reset the enviroment
state = enviroment.reset()
# Initialize variables
reward = 0
terminated = False
while not terminated:
# Take learned path or explore new actions based on the epsilon
if random.uniform(0, 1) < epsilon:
action = enviroment.action_space.sample()
action = np.argmax(q_table[state])
# Take action
next_state, reward, terminated, info = enviroment.step(action)
# Recalculate
q_value = q_table[state, action]
max_value = np.max(q_table[next_state])
new_q_value = (1 – alpha) * q_value + alpha * (reward + gamma * max_value)
# Update Q-table
q_table[state, action] = new_q_value
state = next_state
if (episode + 1) % 100 == 0:
print("Episode: {}".format(episode + 1))
print("Training is done!\n")

view raw hosted with ❤ by GitHub

Note the epsilon value that we added in order to differentiate between expiration and exploration. The rest of the implementation is pretty much straight forward. Finally, we can evaluate the model we trained:

total_epochs = 0
total_penalties = 0
num_of_episodes = 100
for _ in range(num_of_episodes):
state = enviroment.reset()
epochs = 0
penalties = 0
reward = 0
terminated = False
while not terminated:
action = np.argmax(q_table[state])
state, reward, terminated, info = enviroment.step(action)
if reward == -10:
penalties += 1
epochs += 1
total_penalties += penalties
total_epochs += epochs
print("Epochs per episode: {}".format(total_epochs / num_of_episodes))
print("Penalties per episode: {}".format(total_penalties / num_of_episodes))

view raw hosted with ❤ by GitHub

We can see that our agent performed without errors, meaning it picked up passengers and doped them off in the good location 100 times.


In this article, we got a chance to see how the simplest form of reinforcement learning Q-Learning is working. We explored how it all started and some math behind it. Apart from that, we had a chance to implement one simple example of Q-Learning using Python and Open AI gym. The problem with the Q-Learning is that as the number of states grows it becomes difficult to implement them with Q-Table. Today we are using an approach called Deep Q-Learning, which uses neural networks instead of the Q-table, but that is a topic for another article.

Thank you for reading!

Read more posts from the author atRubik’s Code.

Introduction to Q-Learning with Python and Open AI Gym (2024)


What is the OpenAI gym introduction? ›

OpenAI Gym is a toolkit designed for developing and comparing reinforcement learning algorithms. It provides a wide range of environments where researchers and developers can test and benchmark their reinforcement learning algorithms.

What is Q-learning in Python? ›

Q-learning is a model-free, value-based, off-policy algorithm that will find the best series of actions based on the agent's current state. The “Q” stands for quality. Quality represents how valuable the action is in maximizing future rewards.

What is the Q * model in OpenAI? ›

What is Q*? Q* is an AI model that blends Q-learning with A* search. It stands out for its unique ability to solve elementary mathematical problems autonomously, extending beyond the confines of its initial training data.

What is the difference between gym and gymnasium OpenAI? ›

One of the main differences between Gym and Gymnasium is the scope of their environments. Gym provides a wide range of environments for various applications, while Gymnasium focuses on providing environments for deep reinforcement learning research. Another difference is the ease of use.

Is OpenAI gym free? ›

OpenAI Gym is a free Python toolkit that provides developers with an environment for developing and testing learning agents for deep learning models.

Why is OpenAI gym useful? ›

OpenAI Gym is a Pythonic API that provides simulated training environments to train and test reinforcement learning agents. It's become the industry standard API for reinforcement learning and is essentially a toolkit for training RL algorithms.

What is Q-learning simply explained? ›

Q-learning is a reinforcement learning algorithm that finds an optimal action-selection policy for any finite Markov decision process (MDP). It helps an agent learn to maximize the total reward over time through repeated interactions with the environment, even when the model of that environment is not known.

Is Q-learning an AI? ›

Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by taking the correct action. Q-learning is a type of reinforcement learning. With reinforcement learning, a machine learning model is trained to mimic the way animals or children learn.

What are the disadvantages of Q-learning? ›

One of the main drawbacks of Q-learning is that it becomes infeasible when dealing with large state spaces, as the size of the Q-table grows exponentially with the number of states and actions. In such cases, the algorithm becomes computationally expensive and requires a lot of memory to store the Q-values.

What is Q from OpenAI? ›

From what we've heard, Project Q* can: Work out basic math problems and think symbolically better than other AI systems out there. Understand ideas and make smart guesses about them. Move past just recognizing patterns to actually thinking through problems step by step.

How does OpenAI Q-Star work? ›

This technique is part of reinforcement learning, which is about learning through interactions with an environment. In Q-learning, the 'Q' stands for 'quality,' which refers to the value or benefit of taking a certain action in a specific state. The agent is rewarded for good actions and penalized for bad ones.

Who owns OpenAI? ›

OpenAI's ownership is divided between and the original OpenAI non-profit foundation, Microsoft (49%), and other stakeholders (49%) including Andreessen Horowitz (a16z), Sequoia Capital, Thrive Capital, Founders Fund, Khosla Ventures, Infosys, and Y Combinator.

What is similar to OpenAI gym? ›

Other important factors to consider when researching alternatives to OpenAI Gym include ease of use and reliability. The best overall OpenAI Gym alternative is Phrase Localization Platform. Other similar apps like OpenAI Gym are SAP HANA Cloud, Vertex AI, SAS Viya, and Prolific.

What is a Python gym? ›

Gym is a standard API for reinforcement learning, and a diverse collection of reference environments. The Gym interface is simple, pythonic, and capable of representing general RL problems: import gym env = gym. make("LunarLander-v2", render_mode="human") observation, info = env.

How to setup OpenAI gym? ›

Also, if you want to install Gym with the latest merge, you can install it directly from its source code. You do that by cloning the Gym repository from Github and executing pip right away: git clone cd gym pip install -e .

What is the concept behind OpenAI? ›

OpenAI states that AI "should be an extension of individual human wills and, in the spirit of liberty, as broadly and evenly distributed as possible." Co-chair Sam Altman expects the decades-long project to surpass human intelligence.

Is OpenAI Gym a library? ›

Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API.

Top Articles
Latest Posts
Article information

Author: Jamar Nader

Last Updated:

Views: 5455

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Jamar Nader

Birthday: 1995-02-28

Address: Apt. 536 6162 Reichel Greens, Port Zackaryside, CT 22682-9804

Phone: +9958384818317

Job: IT Representative

Hobby: Scrapbooking, Hiking, Hunting, Kite flying, Blacksmithing, Video gaming, Foraging

Introduction: My name is Jamar Nader, I am a fine, shiny, colorful, bright, nice, perfect, curious person who loves writing and wants to share my knowledge and understanding with you.