Reputation: 53
I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.
My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.
The problem in DQN is:
The problem in DDQN is:
I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.
Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.
Is it possible to make the learning stable? Any comments or references are appreciated!
Upvotes: 4
Views: 6297
Reputation: 1
I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x. In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.
Finally solved it. See whether it could help.
clip the gradients before apply them, use a learning rate decay schedule
variables = model.trainable_variables
grads = tape.gradient(loss, variables)
grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
for g, var in zip(grads, variables):
var.assign_sub(g * learning_rate)
use a exploration rate decay schedule
epsilon = 0.85 ** math.log(total_steps + 1, 2)
Upvotes: 0
Reputation: 47
I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!
After searching for a few weeks, I have checked my code several times, changed every setting but one...
The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.
CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py
The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py
I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)
This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.
And here is the result:
Upvotes: 3
Reputation: 223
These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.
Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.
Upvotes: 9
Reputation: 158
Maybe this can help you with your problem on this environment.
Cartpole problem with DQN algorithm from Udacity
Upvotes: 2