Reputation: 1431
I am pretty new to any machine learning methods and i thought i will give it a try trying Q-Learning. So i been reading this article:
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
What is confusing me is that this equation:
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100
this R(1, 5)
keeps changing in the tutorial from 100
to 0
then back to 100
again, WHAT! R matrix is static.
Upvotes: 1
Views: 499
Reputation: 43517
I think one mistake is using R(1, 5)
in the second equation. If you read the text, you'll find that you're in state 3
currently, and you randomly pick state 1
to go to:
For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state.
Look at the fourth row of matrix R; it has 3 possible actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action.
R(3, 1)
is 0
, and the updated Q
matrix that follows in the article also has the value filled in for Q(3, 1)
.
Then, the formula should be:
Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 3), Q(1, 5)] = 0 + 0.8 * 100 = 80
(1, 2)
is -1
, so I think using that is a mistake. The text even says:
Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or state 5
So R(1, 5)
doesn't change: it's always 100
. It's just confused with R(3, 1)
sometimes.
Update
Here is another part of the tutorial that I think should be changed for clarity and correctness and what I think it should say, in order. I bolded the changes I made.
The updated entries of matrix Q, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The result of this computation for Q(1, 5) is 100 because of the instant reward from R(5, 1). This result does not change the Q matrix.
Change to:
The updated entries of matrix Q, Q(5, 1), Q(5, 4), Q(5, 5) (as in, updated from previous operations), are all zero. The result of this computation for Q(1, 5) is 100 because of the instant reward from R(1, 5). This result does not change the Q matrix.
Upvotes: 2