A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (2024)


I have always been fascinated with games. The seemingly infinite options available to perform an action under a tight timeline – it’s a thrilling experience. There’s nothing quite like it.

So when I read about the incredible algorithms DeepMind was coming up with (like AlphaGo and AlphaStar), I was hooked. I wanted to learn how to make these systems on my own machine. And that led me into the world of deep reinforcement learning (Deep RL).

Deep RL is relevant even if you’re not into gaming. Just check out the sheer variety of functions currently using Deep RL for research:

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (1)

What about industry-ready applications? Well, here are two of the most commonly cited Deep RL use cases:

  • Google’s Cloud AutoML
  • Facebook’s Horizon Platform

The scope of Deep RL is IMMENSE. This is a great time to enter into this field and make a career out of it.

In this article, I aim to help you take your first steps into the world of deep reinforcement learning. We’ll use one of the most popular algorithms in RL, deep Q-learning, to understand how deep RL works. And the icing on the cake? We will implement all our learning in an awesome case study using Python.

Table of contents

  • Introduction
  • The Road to Q-Learning
    • RL Agent-Environment
    • Markov Decision Process (MDP)
    • Q Learning
  • What is Deep Q-Learning?
  • Why ‘Deep’ Q-Learning?
  • Deep Q-Networks
  • Challenges in Deep RL as Compared to Deep Learning
    • 1. Target Network
    • 2. Experience Replay
  • Putting it all Together
  • Frequently Asked Questions
  • End Notes

The Road to Q-Learning

There are certain concepts you should be aware of before wading into the depths of deep reinforcement learning. Don’t worry, I’ve got you covered.

I have previously written various articles on the nuts and bolts of reinforcement learning to introduce concepts like multi-armed bandit, dynamic programming, Monte Carlolearning and temporal differencing. I recommend going through these guides in the below sequence:

  • Reinforcement Learning Guide: Solving the Multi-Armed Bandit Problem from Scratch in Python
  • Reinforcement Learning: Introduction to Monte Carlo Learning using the OpenAI Gym Toolkit
  • Introduction to Monte Carlo Tree Search: The Game-Changing Algorithm behind DeepMind’s AlphaGo
  • Nuts and Bolts of Reinforcement Learning: Introduction to Temporal Difference (TD) Learning

These articles are good enough for getting a detailed overview of basic RL from the beginning.

However, note that the articles linked above are in no way prerequisites for the reader to understand Deep Q-Learning. We will do a quick recap of the basic RL concepts before exploring what is deep Q-Learning and its implementation details.

RL Agent-Environment

A reinforcement learning task is about training an agent which interacts with its environment. The agent arrives at different scenarios known as states by performing actions. Actions lead to rewards which could be positive and negative.

The agent has only one purpose here – to maximize its total reward across an episode.This episode is anything and everything that happens between the first state and the last or terminal state within the environment. We reinforce the agent to learn to perform the best actions by experience. This is the strategy or policy.

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (2)

Let’s take an example of the ultra-popular PubG game:

  • The soldier is the agent here interacting with the environment
  • The states are exactly what we see on the screen
  • An episode is a complete game
  • The actions are moving forward, backward, left, right, jump, duck, shoot, etc.
  • Rewards are defined on the basis of the outcome of these actions. If the soldier is able to kill an enemy, that calls for a positive reward while getting shot by an enemy is a negative reward

Now, in order to kill that enemy or get a positive reward, there is a sequence of actions required. This is where the concept of delayed or postponed reward comes into play. The crux of RL is learning to perform these sequences and maximizing the reward.

Markov Decision Process (MDP)

An important point to note – each state within an environment is a consequence of its previous state which in turn is a result of its previous state. However, storing all this information, even for environments with short episodes, will become readily infeasible.

To resolve this, we assume that each state follows a Markov property, i.e., each state depends solely on the previous state and the transition from that state to the current state. Check out the below maze to better understand the intuition behind how this works:

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (3)

Now, there are 2 scenarios with 2 different starting points and the agent traverses different paths to reach the same penultimate state. Now it doesn’t matter what path the agent takes to reach the red state. The next step to exit the maze and reach the last state is by going right. Clearly, we only needed the information on the red/penultimate state to find out the next best action which is exactly what the Markov property implies.

Q Learning

Let’s say we know the expected reward of each action at every step. This would essentially be like a cheat sheet for the agent! Our agent will know exactly which action to perform.

It will perform the sequence of actions that will eventually generate the maximum total reward. This total reward is also called the Q-value and we will formalise our strategy as:

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (4)

The above equation states that the Q-value yielded from being at state s and performing action a is the immediate reward r(s,a) plus the highest Q-value possible from the next state s’. Gamma here is the discount factor which controls the contribution of rewards further in the future.

Q(s’,a) again depends on Q(s”,a) which will then have a coefficient of gamma squared. So, the Q-value depends on Q-values of future states as shown here:

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (5)

Adjusting the value of gamma will diminish or increase the contribution of future rewards.

Since this is a recursive equation, we can start with making arbitrary assumptions for all q-values. With experience, it will converge to the optimal policy. In practical situations, this is implemented as an update:

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (6)

where alpha is the learning rate or step size. This simply determines to what extent newly acquired information overrides old information.

What is Deep Q-Learning?

Deep Q-Learning is a reinforcement learning technique that combines Q-Learning, an algorithm for learning optimal actions in an environment, with deep neural networks. It aims to enable agents to learn optimal actions in complex, high-dimensional environments. By using a neural network to approximate the Q-function, which estimates the expected cumulative reward for each action in a given state, Deep Q-Learning can handle environments with large state spaces. The network is updated iteratively through episodes, using a combination of exploration and exploitation strategies. However, care must be taken to mitigate instability caused by non-stationarity and divergence issues, typically addressed by experience replay and target networks. Deep Q-Learning has proven effective in training agents for various tasks, including video games and robotic control.

Why ‘Deep’ Q-Learning?

Q-learning is a simple yet quite powerful algorithm to create a cheat sheet for our agent. This helps the agent figure out exactly which action to perform.

But what if this cheatsheet is too long? Imagine an environment with 10,000 states and 1,000 actions per state. This would create a table of 10 million cells. Things will quickly get out of control!

It is pretty clear that we can’t infer the Q-value of new states from already explored states. This presents two problems:

  • First, the amount of memory required to save and update that table would increase as the number of states increases
  • Second, the amount of time required to explore each state to create the required Q-table would be unrealistic

Here’s a thought – what if we approximate these Q-values with machine learning models such as a neural network? Well, this was the idea behind DeepMind’s algorithm that led to its acquisition by Google for 500 million dollars!

Deep Q-Networks

In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as the input and the Q-value of all possible actions is generated as the output. The comparison between Q-learning & deep Q-learning is wonderfully illustrated below:

So, what are the steps involved in reinforcement learning using deep Q-learning networks (DQNs)?

  1. All the past experience is stored by the user in memory
  2. The next action is determined by the maximum output of the Q-network
  3. The loss function here is mean squared error of the predicted Q-value and the target Q-value – Q*. This is basically a regression problem. However, we do not know the target or actual value here as we are dealing with a reinforcement learning problem. Going back to the Q-value update equation derived fromthe Bellman equation. we have:
A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (8)

The section in green represents the target. We can argue that it is predicting its own value, but since R is the unbiased true reward, the network is going to update its gradient using backpropagation to finally converge.

Challenges in Deep RL as Compared to Deep Learning

So far, this all looks great. We understood how neural networks can help the agent learn the best actions. However, there is a challenge when we compare deep RL to deep learning (DL):

  • Non-stationary or unstable target:Let us go back to the pseudocode for deep Q-learning:

As you can see in the above code, the target is continuously changing with each iteration. In deep learning, the target variable does not change and hence the training is stable, which is just not true for RL.

To summarise, we often depend on the policy or value functions in reinforcement learning to sample actions. However, this is frequently changing as we continuouslylearn what to explore. As we play out the game, we get to know more about the ground truth values of states and actions and hence, the output is also changing.

So, we try to learn to mapfor a constantly changing input and output. But then what is the solution?

1. Target Network

Since the same network is calculating the predicted value and the target value, there could be a lot of divergence between these two. So, instead of using 1one neural network for learning, we can use two.

We could use a separate network to estimate the target. This target network has the same architecture as the function approximator but with frozen parameters. For every C iterations (a hyperparameter), the parameters from the prediction network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while):

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (10)

2. Experience Replay

To perform experience replay, we store the agent’s experiences –𝑒𝑡=(𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)

What does the above statement mean? Instead of running Q-learning on state/action pairs as they occur during simulation or the actual experience, the system stores the data discovered for [state, action, reward, next_state] – in a large table.

Let’s understand this using an example.

Suppose we are trying to build a video game bot where each frame of the game represents a different state. During training, we could sample a random batch of 64 frames from the last 100,000 frames to train our network. This would get us a subset within which the correlation amongst the samples is low and will also provide better sampling efficiency.

Putting it all Together

The concepts we have learned so far? They all combine to make the deep Q-learning algorithm that was used to achive human-level level performance in Atari games (using just the video frames of the game).

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (11)

I have listed the steps involved in a deep Q-network (DQN) below:

  1. Preprocess and feed the game screen (state s) to our DQN, which will return the Q-values of all possible actions in the state
  2. Select an action using the epsilon-greedy policy. With the probability epsilon, we select a random action a and with probability 1-epsilon, we select an action that has a maximum Q-value, such as a = argmax(Q(s,a,w))
  3. Perform this action in a state s and move to a new state s’ to receive a reward. This state s’ is the preprocessed image of the next game screen.We store this transition in our replay buffer as <s,a,r,s’>
  4. Next, sample some random batches of transitions from the replay buffer and calculate the loss
  5. It is known that:A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (12)which is just the squared difference between target Q and predicted Q
  6. Perform gradient descent with respect to our actual network parameters in order to minimize this loss
  7. After every C iterations, copy our actual network weights to the target network weights
  8. Repeat these steps for M number of episodes

Implementing Deep Q-Learning in Python using Keras & OpenAI Gym

Alright, so we have a solid grasp on the theoretical aspects of deep Q-learning. How about seeing it in action now? That’s right – let’s fire up our Python notebooks!

We will make an agent that can play a game called CartPole.We can also use an Atari game but training an agent to play that takes a while (from a few hours to a day). The idea behind our approach will remain the same so you can try this on an Atari game on your machine.

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (13)

CartPole is one of the simplest environments in the OpenAI gym (a game simulator). As you can see in the above animation, the goal of CartPole is to balance a pole that’s connected with one joint on top of a moving cart.

Instead of pixel information, there are four kinds of information given by the state (such as the angle of the pole and position of the cart). An agent can move the cart by performing a series of actions of 0 or 1, pushing the cart left or right.

We will use thekeras-rl library here which lets us implement deep Q-learning out of the box.

Step 1: Install keras-rl library

From the terminal, run the following code block:

git clone https://github.com/matthiasplappert/keras-rl.gitcd keras-rlpython setup.py install

Step 2: Install dependencies for the CartPole environment

Assuming you have pip installed, you need to install the following libraries:

pip install h5pypip install gym

Step 3: Let’s get started!

First, we have to import the necessary modules:

import numpy as npimport gymfrom keras.models import Sequentialfrom keras.layers import Dense, Activation, Flattenfrom keras.optimizers import Adamfrom rl.agents.dqn import DQNAgentfrom rl.policy import EpsGreedyQPolicyfrom rl.memory import SequentialMemory

Then, set the relevant variables:

ENV_NAME = 'CartPole-v0'# Get the environment and extract the number of actions available in the Cartpole problemenv = gym.make(ENV_NAME)np.random.seed(123)env.seed(123)nb_actions = env.action_space.n

Next, we will build a very simple single hidden layer neural network model:

model = Sequential()model.add(Flatten(input_shape=(1,) + env.observation_space.shape))model.add(Dense(16))model.add(Activation('relu'))model.add(Dense(nb_actions))model.add(Activation('linear'))print(model.summary())

Now, configure and compile our agent. We will set our policy as Epsilon Greedy and our memory as Sequential Memory because we want to store the result of actions we performed and the rewards we get for each action.

policy = EpsGreedyQPolicy()memory = SequentialMemory(limit=50000, window_length=1)dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,target_model_update=1e-2, policy=policy)dqn.compile(Adam(lr=1e-3), metrics=['mae'])# Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot. dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)

Test our reinforcement learning model:

dqn.test(env, nb_episodes=5, visualize=True)

This will be the output of our model:

A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (14)

Not bad! Congratulations on building your very first deep Q-learning model. 🙂

Frequently Asked Questions

Q1. What is different between deep Q and regular Q-learning?

A. The key difference between deep Q-learning and regular Q-learning lies in their approaches to function approximation. Regular Q-learning uses a table to store Q-values for each state-action pair, making it suitable for discrete state and action spaces. In contrast, deep Q-learning employs a deep neural network to approximate Q-values, enabling it to handle continuous and high-dimensional state spaces. While regular Q-learning guarantees convergence, deep Q-learning’s convergence is less assured due to non-stationarity issues caused by updates to the neural network during learning. Techniques like experience replay and target networks are used to stabilize deep Q-learning training.

Q2. What are the limitations of deep Q network?

A. Deep Q-Networks (DQN) come with several limitations. They can suffer from instability during training due to the non-stationarity problem caused by frequent updates of the neural network. Additionally, DQNs might overestimate Q-values, impacting the learning process. They struggle with handling continuous action spaces and can be computationally expensive, requiring significant training time and resources. Exploration in high-dimensional state spaces can be challenging, leading to suboptimal policies. Lastly, tuning hyperparameters for DQNs can be intricate and sensitive, affecting convergence and overall performance. Despite these limitations, techniques like Double Q-learning and prioritized experience replay aim to address some of these challenges.

End Notes

OpenAI gym provides several environments fusing DQN on Atari games. Those who have worked with computer vision problems might intuitively understand this since the input for these are direct frames of the game at each time step, the model comprises of convolutional neural network based architecture.

There are some more advanced Deep RL techniques, such as Double DQN Networks, Dueling DQN and Prioritized Experience replay which can further improve the learning process. These techniques give us better scores using an even lesser number of episodes. I will be covering these concepts in future articles.

I encourage you to try theDQN algorithm on at least 1 environment other than CartPole to practice and understand how you can tune the model to get the best results.

deep reinforcement learningpythonq-learningReinforcement Learning

Ankit Choudhary21 Aug, 2023

IIT Bombay Graduate with a Masters and Bachelors in Electrical Engineering.I have previously worked as a lead decision scientist for Indian National Congress deploying statistical models (Segmentation, K-Nearest Neighbours) to help party leadership/Team make data-driven decisions.My interest lies in putting data in heart of business for data-driven decision making.


A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python (2024)


What is the introduction of OpenAI gym? ›

OpenAI Gym is a Pythonic API that provides simulated training environments to train and test reinforcement learning agents. It's become the industry standard API for reinforcement learning and is essentially a toolkit for training RL algorithms.

What is the difference between gym and gymnasium OpenAI? ›

One of the main differences between Gym and Gymnasium is the scope of their environments. Gym provides a wide range of environments for various applications, while Gymnasium focuses on providing environments for deep reinforcement learning research. Another difference is the ease of use.

How are neural networks used in Deep Q-Learning? ›

The deep Q-learning algorithm employs a deep neural network to approximate values. It generally works by feeding the initial state into the neural network which calculates all possible actions based on the Q-value.

What is Deep Q-Learning introduction? ›

Deep Q Learning uses the Q-learning idea and takes it one step further. Instead of using a Q-table, we use a Neural Network that takes a state and approximates the Q-values for each action based on that state.

What are the benefits of OpenAI gym? ›

It's focused and best suited for a reinforcement learning agent. OpenAI Gym is an environment for developing and testing learning agents. It's best suited as a reinforcement learning agent, but it doesn't prevent you from trying other methods, such as hard-coded game solver or other deep learning approaches.

Is OpenAI gym free? ›

But as you look to integrate machine learning models into your apps or devices, it's all easier now with ready-made AI models flying around the internet. While some of these tools are low-cost, others, including the OpenAI Gym, are free and open-source.

What is the difference between Q-learning and deep learning? ›

The main difference between deep and regular Q-learning is the implementation of the Q-table. In deep Q-learning, this is replaced with two neural networks that handle the learning process. While these networks have the same overarching architectures, they have different weights.

What is the benefit of Deep Q-Learning? ›

Here are some of the benefits of using Deep Q-Learning: It can learn from very large datasets. Deep Q-Learning can learn from very large datasets of state-action pairs. This is because the neural network can learn to represent the Q-function as a function of the state and action.

How to learn deep learning for beginners? ›

Top 5 Tips for Learning Deep Learning
  1. Don't get bogged down by the maths. While a strong mathematical foundation is crucial, it's important not to get overwhelmed. ...
  2. Apply your skills to projects. ...
  3. Balance practical work with reading papers. ...
  4. Join a community. ...
  5. Keep iterating.

What is the goal of Q-Learning? ›

Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by taking the correct action. Q-learning is a type of reinforcement learning. With reinforcement learning, a machine learning model is trained to mimic the way animals or children learn.

What is the concept behind OpenAI? ›

OpenAI states that AI "should be an extension of individual human wills and, in the spirit of liberty, as broadly and evenly distributed as possible." Co-chair Sam Altman expects the decades-long project to surpass human intelligence.

Is OpenAI Gym a library? ›

Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API.

What is the mission of OpenAI? ›

OpenAI is an AI research and deployment company. Our mission is to ensure that artificial general intelligence benefits all of humanity.

How to setup OpenAI gym? ›

Also, if you want to install Gym with the latest merge, you can install it directly from its source code. You do that by cloning the Gym repository from Github and executing pip right away: git clone https://github.com/openai/gym cd gym pip install -e .

Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5467

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.