Q-Learning Algorithm: From Explanation to Implementation (2024)

In my today’s medium post, I will teach you how to implement the Q-Learning algorithm. But before that, I will first explain the idea behind Q-Learning and its limitation. Please be sure to have some Reinforcement Learning (RL) basics. Otherwise, please check my previous post about the intuition and the key math behind RL.

Well, let’s recall some definitions and equations that we need for implementing the Q-Learning algorithm.

In RL, we have an environment that we want to learn. For doing that, we build an agent who will interact with the environment through a trial-error process. At each time step t, the agent is at a certain state s_t and chooses an action a_t to perform. The environment runs the selected action and returns a reward to the agent. The higher is the reward, the better is the action. The environment also tells the agent whether he is done or not. So an episode can be represented as a sequence of state-action-reward.

Q-Learning Algorithm: From Explanation to Implementation (3)

The goal of the agent is to maximize the total rewards he will get from the environment. The function to maximize is called the expected discounted return function that we denote as G.

Q-Learning Algorithm: From Explanation to Implementation (4)

To do so, the agent needs to find an optimal policy 𝜋 which is a probability distribution of a given state over actions.

Under the optimal policy, the Bellman Optimality Equation is satisfied:

Q-Learning Algorithm: From Explanation to Implementation (5)

where q is the Action-Value function or Q-Value function.

All these functions are explained in my previous post.

In the Q-Learning algorithm, the goal is to learn iteratively the optimal Q-value function using the Bellman Optimality Equation. To do so, we store all the Q-values in a table that we will update at each time step using the Q-Learning iteration:

Q-Learning Algorithm: From Explanation to Implementation (6)

where α is the learning rate, an important hyperparameter that we need to tune since it controls the convergence.

Now, we would start implementing the Q-Learning algorithm. But, we need to talk about the exploration-exploitation trade-off. But Why? In the beginning, the agent has no idea about the environment. He is more likely to explore new things than to exploit his knowledge because…he has no knowledge. Through time steps, the agent will get more and more information about how the environment works and then, he is more likely to exploit his knowledge than exploring new things. If we skip this important step, the Q-Value function will converge to a local minimum which in most of the time, is far from the optimal Q-value function. To handle this, we will have a threshold which will decay every episode using exponential decay formula. By doing that, at every time step t, we will sample a variable uniformly over [0,1]. If the variable is smaller than the threshold, the agent will explore the environment. Otherwise, he will exploit his knowledge.

Q-Learning Algorithm: From Explanation to Implementation (7)

where N_0 is the initial value and λ, a constant called decay constant.

Below is an example of the exponential decay:

Q-Learning Algorithm: From Explanation to Implementation (8)

Alright, now we can start coding. Here, we will use the FrozenLake environment of the gym python library which provides many environments including Atari games and CartPole.

FrozenLake environment consists of a 4 by 4 grid representing a surface. The agent always starts from the state 0, [0,0] in the grid, and his goal is to reach the state 16, [4,4] in the grid. On his way, he could find some frozen surfaces or fall in a hole. If he falls, the episode is ended. When the agent reaches the goal, the reward is equal to one. Otherwise, it is equal to 0.

Q-Learning Algorithm: From Explanation to Implementation (9)

First, we import the needed libraries. Numpy for accessing and updating the Q-table and gym to use the FrozenLake environment.

import numpy as np
import gym

Then, we instantiate our environment and get its sizes.

env = gym.make("FrozenLake-v0")
n_observations = env.observation_space.n
n_actions = env.action_space.n

We need to create and initialize the Q-table to 0.

#Initialize the Q-table to 0
Q_table = np.zeros((n_observations,n_actions))
print(Q_table)
Q-Learning Algorithm: From Explanation to Implementation (10)

We define the different parameters and hyperparameters we talked about earlier in this post

To evaluate the agent training, we will store the total rewards he gets from the environment after each episode in a list that we will use after the training is finished.

Now let’s go to the main loop where all the process will happen

Please read all the comments to follow the algorithm.

Once our agent is trained, we will test his performance using the rewards per episode list. We will do that by evaluating his performance every 1000 episodes.

Q-Learning Algorithm: From Explanation to Implementation (11)

As we can notice, the performance of the agent is very bad in the beginning but he improved his efficiency through training.

Q-learning algorithm is a very efficient way for an agent to learn how the environment works. Otherwise, in the case where the state space, the action space or both of them are continuous, it would be impossible to store all the Q-values because it would need a huge amount of memory. The agent would also need many more episodes to learn about the environment. As a solution, we can use a Deep Neural Network (DNN) to approximate the Q-Value function since DNNs are known for their efficiency to approximate functions. We talk about Deep Q-Networks and this will be the topic of my next post.

I hope you understood the Q-Learning algorithm and enjoyed this post.

Thank you!

Q-Learning Algorithm: From Explanation to Implementation (2024)

FAQs

What is Q-learning algorithm from explanation to implementation? ›

Q-learning is a reinforcement learning algorithm that finds an optimal action-selection policy for any finite Markov decision process (MDP). It helps an agent learn to maximize the total reward over time through repeated interactions with the environment, even when the model of that environment is not known.

How do you solve Q-learning problems? ›

Example Of Q-Learning
  1. Initialize the Q-table: Q = [ [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], ...
  2. Observe the state: Let's say the agent starts in position (0, 0).
  3. Choose an action: ...
  4. Execute the action: ...
  5. Update the Q-table: ...
  6. Repeat steps 2-5 until the agent reaches the goal state: ...
  7. Repeat steps 1-6 for multiple episodes:
Apr 27, 2023

What is the formula for Q-learning? ›

The equation breaks down as follows: Q(s, a) represents the expected reward for taking action a in state s. The actual reward received for that action is referenced by r while s' refers to the next state. The learning rate is α and γ is the discount factor.

What is the q * algorithm? ›

The Q algorithm generates nodes in the search space, applying semantic and syntactic information to direct the search. The use of semantics permits paths to be terminated and fruitful paths to be explored. The paper is restricted to a description of the use of syntactic and semantic information in the Q algorithm.

What is the Q-learning principle? ›

Q-learning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a non-optimal policy. It has proven to be effective for models with finite state and action space.

Is the Q-learning model-free? ›

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.

What is the alternative to Q-learning? ›

VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning. Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning both in tabular implementations and deep RL agents on Atari-57 games.

Why is Q-learning biased? ›

The overestimation bias occurs since the target maxa ∈A Q(st+1,a ) is used in the Q-learning update. Because Q is an approximation, it is probable that the approximation is higher than the true value for one or more of the actions. The maximum over these estimators, then, is likely to be skewed towards an overestimate.

What are the hyperparameters of Q-learning? ›

α and γ are the q-learning rate and discount factor, which are both hyperparameters of the Q-Learning algorithm. The q-learning rate determines how quickly new information is learned, while the discount factor determines how much value we assign to short term versus long term rewards.

Does Q-learning always converge? ›

We show that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely.

Why is Q-learning considered an off-policy control method? ›

Q-learning is a common example of off-policy RL. Like SARSA, the behavior policy generates random control actions with a small probability. Unlike SARSA however, Q-Learning uses the outcome of this action to separately update the value function for a greedy (target) policy.

What is algorithm formula? ›

An algorithm, especially in mathematics, is a step-by-step procedure that can be used to solve computations or other mathematical problems. So, an algorithm can be thought of as a set of directions for solving mathematical computations and problems. This is the algorithm definition that is used throughout mathematics.

What is Q-learning OpenAI? ›

It is a model-free approach in reinforcement learning that does not rely on pre-existing knowledge of its environment. Instead, it adopts a learn-by-doing strategy, adjusting its actions based on the outcomes they produce — rewards or penalties.

What is the algorithm of deep Q-learning? ›

The Deep Q-Learning training algorithm has two phases: Sampling: we perform actions and store the observed experience tuples in a replay memory. Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.

What is the implementation of Q-learning in Python? ›

Implementation of Q-Learning

n this Q-learning implementation, a grid world environment is defined with 16 states, and agents can take 4 possible actions: up, down, left, and right. The goal is to reach state 15. The Q-table, initialized with zeros, serves as a memory to store Q-values for state-action pairs.

What is explanation based learning algorithm in machine learning? ›

Explanation-based learning is a type of machine learning that uses a very strong, or even flawless, domain theory to generalise or create ideas from training data. It is also tied to Encoding, which aids in learning.

What is the algorithm of deep Q learning? ›

The Deep Q-Learning training algorithm has two phases: Sampling: we perform actions and store the observed experience tuples in a replay memory. Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.

Which algorithms are like Q-learning? ›

These… The most popular reinforcement learning algorithms include Q-learning, SARSA, DDPG, A2C, PPO, DQN, and TRPO. These algorithms have been used to achieve state-of-the-art results in various applications such as game playing, robotics, and decision making.

Top Articles
Latest Posts
Article information

Author: Tyson Zemlak

Last Updated:

Views: 5431

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Tyson Zemlak

Birthday: 1992-03-17

Address: Apt. 662 96191 Quigley Dam, Kubview, MA 42013

Phone: +441678032891

Job: Community-Services Orchestrator

Hobby: Coffee roasting, Calligraphy, Metalworking, Fashion, Vehicle restoration, Shopping, Photography

Introduction: My name is Tyson Zemlak, I am a excited, light, sparkling, super, open, fair, magnificent person who loves writing and wants to share my knowledge and understanding with you.