Jack
Jack

Reputation: 53

Why is my Deep Q Net and Double Deep Q Net unstable?

I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.

My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.

The problem in DQN is:

The problem in DDQN is:

I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.

Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.

Is it possible to make the learning stable? Any comments or references are appreciated!

Upvotes: 4

Views: 6297

Answers (4)

鲁达鲁提辖
鲁达鲁提辖

Reputation: 1

I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x. In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.

Finally solved it. See whether it could help.

  1. clip the gradients before apply them, use a learning rate decay schedule

     variables = model.trainable_variables
     grads = tape.gradient(loss, variables)
     grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
     learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
     for g, var in zip(grads, variables):
         var.assign_sub(g * learning_rate)
    
  2. use a exploration rate decay schedule

     epsilon = 0.85 ** math.log(total_steps + 1, 2)
    

Upvotes: 0

anna12345
anna12345

Reputation: 47

I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!

After searching for a few weeks, I have checked my code several times, changed every setting but one...

  1. The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.

  2. CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py

  3. The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py

I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)

This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.

And here is the result:

ddqn-rewards-cartpole-v1

Upvotes: 3

Filip O.
Filip O.

Reputation: 223

These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.

Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.

  • Exploration technique: I assume you use epsilon-greedy strategy. My advice would be to start with a high epsilon value (I usually start with 1.0) and decay it after each step or episode, but define an epsilon_min too. Starting with a low epsilon value may be the problem of different learning speeds and success rates - if you go full random, you always populate your memory with similar kind of transitions at the beginning. With lower epsilon rates at the start, there is a bigger chance for your model to not explore enough before the exploitation phase begins.
  • Learning rate: Make sure it is not too big. Smaller rate may lower the learning speed, but helps a learned model to not escape back from global minima to some local, worse ones. Also, adaptive learning rates such as these calculated with Adam might help you. Of course the batch size have an impact as well, but I would keep it fixed and worry about it only if the other hyper-parameter changes won't work.
  • Target network update (rate and value): This is an important one as well. You have to experiment a bit - not only how often do you perform the update, but also how much of the primary values you copy into the target ones. People often do a hard update each episode or so, but try doing soft updates instead if the first technique does not work.
  • Experience replay: Do you use it? You should. How big is your memory size? This is very important factor and the memory size can influence the stability and success rate (A Deeper Look at Experience Replay). Basically, if you notice instability of your algorithm, try a bigger memory size, and if it affects your learning curve a lot, try out the technique proposed in the mentioned paper.

Upvotes: 9

Maybe this can help you with your problem on this environment.

Cartpole problem with DQN algorithm from Udacity

Upvotes: 2

Related Questions