greece57
greece57

Reputation: 421

Q-Learning Table converges to -inf

I tried to solve the aigym mountain-car problem with my own q-learning implementation.

After trying around different things it started to work really good, but after a while (20k Episodes * 1000 Samples per Episode) I noticed that my the values stored in my Q-table got to big and so it stored the value -inf.

During the simulation I used to following code:

for t in range(SAMPLE_PER_EPISODE):

    observation, reward, done, info = env.step(action)
    R[state, action] = reward

    history.append((state,action,reward))

    max_indexes = np.argwhere(Q[state,] == np.amax(Q[state,])).flatten()
    action = np.random.choice(max_indexes)

For learning I used the following code after each episode:

#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
    Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma)

    latest_best = np.max(Q[entry[0],:])
    total_reward += entry[2]

I got really good results with that algorithm but the problem was - as explained above - that the Q-Values went really fast to -inf

I think I implemented the Q-Algorithm wrong, but after changing it to the following implementation, it doesn't work anymore (nearly as good as it did before):

#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
    # Here I changed the code
    Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])

    latest_best = np.max(Q[entry[0],:])
    total_reward += entry[2]

What am I doing wrong?

Upvotes: 2

Views: 352

Answers (1)

Sentry
Sentry

Reputation: 4113

I think there are two problems with your code:

  1. Firstly, your learning rate is probably too high (lr = 0.99 from your comment) and your discounting factor (gamma = 0.8) might be, too.

The book Reinforcement Learning: An Introduction by Richard S. Sutton, one of the founding fathers of Reinforcement learning, is available online and I highly recommend you use it as a reference.

Q-Learning is a special case of Temporal Difference Learning and subchapter 6.2 mostly uses learning rates smaller than 0.15.

  1. Assuming that entry[0] is x_k, entry[1] is u_k and entry[2] is r_{k+1}, then this line

     Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])
    

is equivalent to

    Q[x_k, u_k] = Q[x_k, u_k] + lr * (r_{k+1} + latest_best * gamma - Q[x_k, u_k])

If this is supposed to represent the formula enter image description here there is a problem with your first version, because you basically keep summing up rewards that are only slightly discounted. The second version with the additional -Q[x_k, u_k] should be correct.

Other SO questions you might want to look at:

Upvotes: 6

Related Questions