Deep Q Learning

9 minute read

Published:

In this post we aim to provide the relevant mathematical and programmatic background for implementing a Deep Q Learning algorithm. We discuss what Q learning is and how the principles from classical Q learning can be used to construct deep Q learning algorithms in Python. We will be borrowing heavily from Brunton and Kutz’s book titled “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control”.

General Reinforcement Learning

Before we begin our discussion of Q learning, we will begin with an exposition of the two primary categories of RL: model-based and model-free. Model-based RL systems create a model of the environment, which includes the dynamics of how actions lead to subsequent states and rewards. Essentially, it predicts the future states and rewards for actions taken from a given state. Take for example the gambler problem, which provides explicit probability measure \(P(s',s,a)\) that provides explicit probabilities as to whether the actor will win or lose a gamble. In other words, the gambler queries the environment for its transition dynamics. The second approach is model-free, which is where a system learns a policy or value function directly from interactions with the environment without constructing an explicit model of the environment’s dynamics. The system learns what to do by trial and error, adjusting its actions based on the rewards received.

The second distinction that we’d like to note is two subcategories of model-free RL; off and on-policy learning. On-Policy learning algorithms learn the value of the policy that is currently being used to make decisions. Put simply, the policy used to generate behavior (exploration) is the same policy that is being evaluated and improved. Essentially, it learns on the job, using the data generated by its current strategy. By contrast off-policy learning algorithms can learn about one policy (the target policy) while using a different policy (the behavior policy) to generate behavior. This means it can learn from data that was generated from a previous policy or even from data generated by other agents. This separation allows for greater flexibility and efficiency in some contexts.

Q-Learning

With the general background of RL out of the way, we note that Q learning is a model-free and off-policy training scheme. That means that Q learning does not rely on an environment that can produce exact transition probabilities and must instead use trial and error of optimal or sub-optimal actions in order to learn the optimal policy. In order to formalize this type of training scheme we first define the quality function \(Q(s,a)\), which tells you the joint value/quality of taking an action \(a\) given a current state \(s\).

To formally define \(Q(s,a)\) we first introduce some notation. We let \(R(s',s,a)\) be the reward of transitioning from state \(s\) to \(s'\) through action \(a\), \(\gamma\) be the discount factor of future reward, and \(V(s')\) be the expected value over all possible future rewards given a current state \(s'\). With that we define \(Q(s,a)\) as

\[Q(s,a) = \mathbb{E}{(R(s',s,a) + \gamma V(s'))} = \sum_{s'}P(s'|s,a)(R(s',s,a) + \gamma V(s'))\]

In other words, \(Q\) is the expected sum of the instantaneous reward of the state-action pair \((s,a)\) along with the discounted future rewards of being at a new state \(s'\) brought on by \((s,a)\). Using the quality function \(Q(s,a)\) we can define an optimal policy \(\pi\) and value function \(V(s)\), that considers which action \(a\) is optimal and what the expected reward from taking that action is.

\[V(s) = \max\limits_{a}Q(s,a)\] \[\pi(s) = \arg \max\limits_{a}Q(s,a)\]

We can even rewrite the equations above in terms of Bellman’s equation, which tells us that the value of being in state \(s\) is the expected current and future return of applying the optimal action \(a\).

\[V(s) = \max\limits_{\pi}\mathbb{E}{(R(s',s,a) + \gamma V(s'))}\]

We’ve defined what a \(q\) value is and how to construct the quality function \(Q(s,a)\), but now we define a recursive equation to update \(Q(s,a)\) as the agent learns through trial and error.

\[Q^{\text{new}}(s_k, a_k) = Q^{\text{old}}(s_k, a_k) + \alpha \left( \underbrace{r_k + \gamma \max\limits_{a}Q(s_{k+1},a) - Q^{\text{old}}(s_k,a_k)}_{\text{TD Error}} \right)\]

Let’s dissect this update equation. As we engage in trial and error learning, we update our \(Q(s,a)\) by slowly nudging our \(q\) values up or down by a factor of the difference between the actual current and future reward \(r_k + \gamma \max\limits_{a}Q(s_{k+1},a)\) and our predicted reward \(Q^{\text{old}}(s_k,a_k))\). This difference is sometimes referred to as the “Temporal Difference (TD) error”. We also note that \(\alpha\) in this case is the learning rate.

We draw the readers attention to two interesting features of the update equation. First to the term \(\gamma \max\limits_{a}Q(s_{k+1},a)\). The reason why we maximize \(Q\) over \(a\) is to guarantee that the quality \(Q(s_k,a_k)\) is a function of the optimal actions that can be taken given a new state \(s'\) brought on by \((s_k,a_k)\). \(r_k\) on the hand, which is the reward derived from the pair \((s_{k-1},a_{k-1})\) is not required to be the optimal reward, and will most likely be sub-optimal during the training process. This is precisely the reason why Q Learning is an off-policy learning scheme.

Second we note that when calculating the TD error, the agent calculates the future reward by looking one step into the future. This idea of looking forward in time by one time step to calculate error is why the error is called the TD error. Naturally, the degree to which the agent looks into the future can be modified.

Deep Q Learning

Now we discuss the process of recasting Q Learning into training schemes that incorporate deep learning methods like neural networks. This is especially useful in settings where the state space is too large to reasonably store all \(q\) values in a \(Q\) table. Here, a deep learning approach seeks to parameterize \(Q(s,a)\) as dependant on some weights \(\theta\) such that

\[Q(s,a) \approx Q(s,a,\theta)\]

In order to optimize for the parameters \(\theta\) within a Q Learning framework we use a loss function that is eerily similar to the TD error defined above. This framework is known as a Vanilla Deep Q Network (V-DQN).

\[\mathcal{L} = \mathbb{E}[(r_k + \gamma \max\limits_{a}Q(s_{k+1},a,\theta) - Q(s_k,a_k,\theta))^2]\]

With the loss function properly defined, we can move on to the neural network architecture, and how it allows us to approximate the \(Q\) function. In practice, we find that the architecture of the network varies based on the use case. For example a team of DeepMind engineers in 2013 coupled the loss function described above with several convolutions layers and fully connected layers. The inputs were several consecutive frames for the game, which represented the agent’s state. And the output was one of several possible action that the agent could take. Regardless of the architecture used the value contained in each output node approximates the value \(Q(s,a)\) for the associated \((s,a)\) which corresponds to the network’s \((\text{Input},\text{output})\) pair.

The deep Q networks described above can be difficult to train due large variance in the estimated \(Q\) function. To mitigate this problem we can introduce two separate schemes to deal with this problem.

Stabalization Schemes

First we introduce the fixed target distribution Q learning framework, known as T-DQN. This framework involves introducing a second neural network to approximate the temporal difference target, instead of using the same main network to predict the quality of the next state. Under this framework TD Error and TD Target are defined as follows

\[\underbrace{\overbrace{r_k + \gamma \max\limits_{a}Q(s_{k+1},a;\theta^-)}^{\text{TD Target}} - Q^{\text{old}}(s_k,a_k;\theta)}_{\text{TD Error}}\]

where \(Q(s', a'; \theta^-)\) is the Q value predicted by the target network for the next state-action pair, and \(Q(s, a; \theta)\) is the Q value predicted by the main Q network for the current state-action pair.

\[TD_{\text{target}}^{\text{T-DQN}} = r + \gamma \max_{a'} Q(s', a'; \theta^-)\]

However, unlike the main network, instead of updating the target network at every training step, which can lead to high variance in the target Q values and general instability in training, the target network’s weights are only updated after a fixed number of steps or episodes. This creates a stable target distribution for the Q values against which the policy network is updated. However, given the training scheme of the target network it is functionally just a delayed copy of the main network. Because the target and actual \(Q\) values are being estimated by similar networks, we run the risk of overestimating the target when our data has high variance. To mitigate the problem of overestimation we can propose another framework called the Double Deep Q Network (D-DQN).

Like fixed target distribution Q learning, a double deep Q learning framework relies on a target network and a main network. However in this case, the main Q network is responsible for choosing the best action, but the target network is used to evaluate the quality of that action. Its a fine difference that can be formulated as follows

\[TD_{\text{target}}^{\text{DDQN}} = r + \gamma Q(s', \underset{a'}{\mathrm{argmax}} Q(s', a'; \theta); \theta^-)\]