DQN performance swinging

Question

I'm using DDQN with experience replay just like in this tutorial https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html except that I'm making the problem a little harder by obscuring x_dot and theta_t(cart velocity and angular velocity of the pole). I then calculate, considering the current state, the previous x_dot, theta_dot, x_dot_dot and theta_dot_dot and do the learning process by using this state space: (x, prev_x_dot, prev_prev_x_dot_dot, theta, prev_theta_dot, prev_prev_theta_dot_dot).

Anyway, my main issue is that by using the DQN algorithm as described in the above linked tutorial, the algorithm does not converge. I'm considering the learning to be successful if the average length of the last 100 episodes is > 450. When executing, I may see 50-60 consecutive 500 long episodes, but then the episode length randomly swings and goes down to even 20!?!? I need to push the problem harder by starting from any beginning position within a certain range(for each x, theta), but the results till now have not been promising.

Is this a normal behaviour for such an algorithm as DQN is? I get that being the policy computed based on previous executions there may be some convergence issue of the loss function, but does this include such severe swings in the performance?

I'm using a net with 3 nonlinear layers, and linear layers of dimension 256x256.

DQN performance swinging

Answers (1)

Related Questions