anna12345
anna12345

Reputation: 47

How to get Q Values in RL - DDQN

I'm not sure how to get the Q Values for a DDQN.

DQN is the normal network, TAR the target network.

    q_values = self.DQN.predict(c_states) # DQN batch predict Q on states
    dqn_next = self.DQN.predict(n_states) # DQN batch predict Q on next_states
    tar_next = self.TAR.predict(n_states) # TAR batch predict Q on next_states

I mainly found 2 versions:

Version 1:

q_values[i][actions[i]] = (rewards[i] + (GAMMA * np.amax(tar_next[i])))

Version 2:

act = np.argmax(dqn_next[i])
q_values[i][actions[i]] = (rewards[i] + (GAMMA * tar_next[i][act]))

Which one is correct? And why?

Version 1 Links:

https://github.com/keon/deep-q-learning/blob/master/ddqn.py

https://pythonprogramming.net/training-deep-q-learning-dqn-reinforcement-learning-python-tutorial

Version 2 Links:

https://github.com/germain-hug/Deep-RL-Keras/blob/master/DDQN/ddqn.py

https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/2-double-dqn/cartpole_ddqn.py

https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/


EDIT: Many thanks, to clarify this

Q-learning: 
q_values[i][actions[i]] = (rewards[i] + (GAMMA * np.amax(tar_next[i])))

SARSA: 
act = np.argmax(dqn_next[i])
q_values[i][actions[i]] = (rewards[i] + (GAMMA * tar_next[i][act]))

EDIT: re-open 03/2020

I'm sorry but i have to re-open that question. Maybe I misunderstood something, but the following sources show that my Version 2 (SARSA) is Double Q Learning?

Page 158 : Double Q-learning http://incompleteideas.net/book/RLbook2018.pdf

adventuresinML

adventuresinML source

Upvotes: 1

Views: 1174

Answers (2)

anna12345
anna12345

Reputation: 47

Thanks to your help and the informations here leosimmons, I found the source of my confusion:

The Bellman equation used here Bellman equation - link 3 follows the equation:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

The Bellman equation in the original (vanilla) DQN Bellman equation - link 2 is:

value = reward + discount_factor * max(target_network.predict(next_state))

leosimmons

The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in 3.

1 https://medium.com/@leosimmons/double-dqn-implementation-to-solve-openai-gyms-cartpole-v-0-df554cd0614d

2 https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

3 https://arxiv.org/pdf/1509.06461.pdf

Very well explained here: https://youtu.be/ILDLT97FsNM?t=331

Upvotes: 0

Simon
Simon

Reputation: 5402

This is Q-learning (the version with the max operator) vs SARSA (without the max).

In short, you collect samples using the e-greedy policy: this is your behavior (or exploration) policy. The policy you want to learn is called "target" and can be different.
With Q-learning, you use the max operator, so your target is chosen according to the greedy (target) policy. This is called off-policy learning, because you learn a policy (target) with the samples collected by a different one (behavior).
With SARSA, there is no max, so in practice you just use the action from the samples, that was selected by the behavior policy. This is on-policy, because the target and the behavior are the same.

Which one to prefer is up to you, but I think that Q-learning is more common (and DQN uses Q-learning).

More reading about this

What is the difference between Q-learning and SARSA?

Are Q-learning and SARSA with greedy selection equivalent?

https://stats.stackexchange.com/questions/184657/what-is-the-difference-between-off-policy-and-on-policy-learning

http://incompleteideas.net/book/RLbook2018.pdf

EDIT FOR DDQN

SARSA and Q-learning are two separate algorithms.
In DDQN you have two target Q, and two target policies, so the algorithm is still off-policy (sampling policy is e-greedy, target policies are greedy), while SARSA is on-policy (target policy = sampling policy).
The trick in DDQN is that you use the max operator over Q2 (second critic) in the TD target for updating Q1 (first critic), and viceversa. But there still is the max, so it's still off-policy. SARSA, instead, is on-policy.

There are multiple versions of DDQN, some use the mininum over Q1 and Q2 for instance. Here are some references

https://arxiv.org/pdf/1509.06461.pdf

https://arxiv.org/pdf/1802.09477.pdf

Upvotes: 2

Related Questions