Reputation: 47
I'm not sure how to get the Q Values for a DDQN.
DQN is the normal network, TAR the target network.
q_values = self.DQN.predict(c_states) # DQN batch predict Q on states
dqn_next = self.DQN.predict(n_states) # DQN batch predict Q on next_states
tar_next = self.TAR.predict(n_states) # TAR batch predict Q on next_states
I mainly found 2 versions:
Version 1:
q_values[i][actions[i]] = (rewards[i] + (GAMMA * np.amax(tar_next[i])))
Version 2:
act = np.argmax(dqn_next[i])
q_values[i][actions[i]] = (rewards[i] + (GAMMA * tar_next[i][act]))
Which one is correct? And why?
Version 1 Links:
https://github.com/keon/deep-q-learning/blob/master/ddqn.py
https://pythonprogramming.net/training-deep-q-learning-dqn-reinforcement-learning-python-tutorial
Version 2 Links:
https://github.com/germain-hug/Deep-RL-Keras/blob/master/DDQN/ddqn.py
https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/
EDIT: Many thanks, to clarify this
Q-learning:
q_values[i][actions[i]] = (rewards[i] + (GAMMA * np.amax(tar_next[i])))
SARSA:
act = np.argmax(dqn_next[i])
q_values[i][actions[i]] = (rewards[i] + (GAMMA * tar_next[i][act]))
EDIT: re-open 03/2020
I'm sorry but i have to re-open that question. Maybe I misunderstood something, but the following sources show that my Version 2 (SARSA) is Double Q Learning?
Page 158 : Double Q-learning http://incompleteideas.net/book/RLbook2018.pdf
Upvotes: 1
Views: 1174
Reputation: 47
Thanks to your help and the informations here leosimmons, I found the source of my confusion:
The Bellman equation used here Bellman equation - link 3 follows the equation:
value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]
The Bellman equation in the original (vanilla) DQN Bellman equation - link 2 is:
value = reward + discount_factor * max(target_network.predict(next_state))
The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in 3.
1 https://medium.com/@leosimmons/double-dqn-implementation-to-solve-openai-gyms-cartpole-v-0-df554cd0614d
2 https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
3 https://arxiv.org/pdf/1509.06461.pdf
Very well explained here: https://youtu.be/ILDLT97FsNM?t=331
Upvotes: 0
Reputation: 5402
This is Q-learning (the version with the max operator) vs SARSA (without the max).
In short, you collect samples using the e-greedy policy: this is your behavior (or exploration) policy. The policy you want to learn is called "target" and can be different.
With Q-learning, you use the max operator, so your target is chosen according to the greedy (target) policy. This is called off-policy learning, because you learn a policy (target) with the samples collected by a different one (behavior).
With SARSA, there is no max, so in practice you just use the action from the samples, that was selected by the behavior policy. This is on-policy, because the target and the behavior are the same.
Which one to prefer is up to you, but I think that Q-learning is more common (and DQN uses Q-learning).
More reading about this
What is the difference between Q-learning and SARSA?
Are Q-learning and SARSA with greedy selection equivalent?
http://incompleteideas.net/book/RLbook2018.pdf
EDIT FOR DDQN
SARSA and Q-learning are two separate algorithms.
In DDQN you have two target Q, and two target policies, so the algorithm is still off-policy (sampling policy is e-greedy, target policies are greedy), while SARSA is on-policy (target policy = sampling policy).
The trick in DDQN is that you use the max operator over Q2 (second critic) in the TD target for updating Q1 (first critic), and viceversa. But there still is the max, so it's still off-policy. SARSA, instead, is on-policy.
There are multiple versions of DDQN, some use the mininum over Q1 and Q2 for instance. Here are some references
https://arxiv.org/pdf/1509.06461.pdf
https://arxiv.org/pdf/1802.09477.pdf
Upvotes: 2