Reputation: 421
I tried to solve the aigym mountain-car problem with my own q-learning implementation.
After trying around different things it started to work really good, but after a while (20k Episodes * 1000 Samples per Episode) I noticed that my the values stored in my Q-table got to big and so it stored the value -inf.
During the simulation I used to following code:
for t in range(SAMPLE_PER_EPISODE):
observation, reward, done, info = env.step(action)
R[state, action] = reward
history.append((state,action,reward))
max_indexes = np.argwhere(Q[state,] == np.amax(Q[state,])).flatten()
action = np.random.choice(max_indexes)
For learning I used the following code after each episode:
#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma)
latest_best = np.max(Q[entry[0],:])
total_reward += entry[2]
I got really good results with that algorithm but the problem was - as explained above - that the Q-Values went really fast to -inf
I think I implemented the Q-Algorithm wrong, but after changing it to the following implementation, it doesn't work anymore (nearly as good as it did before):
#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
# Here I changed the code
Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])
latest_best = np.max(Q[entry[0],:])
total_reward += entry[2]
What am I doing wrong?
Upvotes: 2
Views: 352
Reputation: 4113
I think there are two problems with your code:
gamma
= 0.8) might be, too.The book Reinforcement Learning: An Introduction by Richard S. Sutton, one of the founding fathers of Reinforcement learning, is available online and I highly recommend you use it as a reference.
Q-Learning is a special case of Temporal Difference Learning and subchapter 6.2 mostly uses learning rates smaller than 0.15.
Assuming that entry[0]
is x_k, entry[1]
is u_k and entry[2]
is r_{k+1}, then this line
Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])
is equivalent to
Q[x_k, u_k] = Q[x_k, u_k] + lr * (r_{k+1} + latest_best * gamma - Q[x_k, u_k])
If this is supposed to represent the formula
there is a problem with your first version, because you basically keep summing up rewards that are only slightly discounted. The second version with the additional -Q[x_k, u_k]
should be correct.
Other SO questions you might want to look at:
Upvotes: 6