Q-Learning Table converges to -inf

Question

I tried to solve the aigym mountain-car problem with my own q-learning implementation.

After trying around different things it started to work really good, but after a while (20k Episodes * 1000 Samples per Episode) I noticed that my the values stored in my Q-table got to big and so it stored the value -inf.

During the simulation I used to following code:

for t in range(SAMPLE_PER_EPISODE):

    observation, reward, done, info = env.step(action)
    R[state, action] = reward

    history.append((state,action,reward))

    max_indexes = np.argwhere(Q[state,] == np.amax(Q[state,])).flatten()
    action = np.random.choice(max_indexes)

For learning I used the following code after each episode:

#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
    Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma)

    latest_best = np.max(Q[entry[0],:])
    total_reward += entry[2]

I got really good results with that algorithm but the problem was - as explained above - that the Q-Values went really fast to -inf

I think I implemented the Q-Algorithm wrong, but after changing it to the following implementation, it doesn't work anymore (nearly as good as it did before):

#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
    # Here I changed the code
    Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])

    latest_best = np.max(Q[entry[0],:])
    total_reward += entry[2]

What am I doing wrong?

Q-Learning Table converges to -inf

Answers (1)

Related Questions