Reputation: 66
I've read on wikipedia https://en.wikipedia.org/wiki/Q-learning
Q-learning may suffer from slow rate of convergence, especially when the discount factor {\displaystyle \gamma } \gamma is close to one.[16] Speedy Q-learning, a new variant of Q-learning algorithm, deals with this problem and achieves a slightly better rate of convergence than model-based methods such as value iteration
So I wanted to try speedy q-learning, and see how better it is.
The only source about it I could find on the internet is this: https://papers.nips.cc/paper/4251-speedy-q-learning.pdf
That's the algorithm they suggest.
Now, I didn't understand it. what excactly is TkQk, Am I supposed to have another list of q-values? Is there any clearer explanation than this?
Q[previousState][action] = ((Q[previousState][action]+(learningRate * ( reward + discountFactor * maxNextExpectedReward - Q[previousState][action]) )));
this is my current QLearning algorithm, I want to replace it to speedy Q-learning.
Upvotes: 0
Views: 1119
Reputation: 6689
A first consideration: if you are trying to speed up Q-learning for a practical problem, I would choose other options before Speedy Q-learning, such as the well-known Q(lambda), i.e., Q-learning combined with elegibility traces. Why? Becuase there are tons of information and experimental (good) results with eligibility traces. In fact, as the Speedy Q-learning authors suggest, the working principle of both methods are similar:
The idea of using previous estimates of the action-values has already been used to improve the performance of Q-learning. A popular algorithm of this kind is
Q(lambda)
[14, 20], which incorporates the concept of eligibility traces in Q-learning, and has been empirically shown to have a better performance than Q-learning, i.e., Q(0), for suitable values oflambda
.
You can find a nice introduction in the Sutton and Barto RL book. If you are simply interested in study the differences between Speedy Q-learning and the standard version, go on.
A now your question. Yes, you have to maintain two separate lists of Q-values, one for the current time k
and another for the previous k-1
, namely, Q_{k}
and Q_{k-1}
respectively.
In the common case (including your case), TQ_{k} = r(x,a) + discountFactor * max_{b in A} Q_{k}(y,b)
, where y
is the next state and b
the action that maximizes Q_{k}
for the given state. Notice that you are using that operator in the standard Q-learning, which has the following update rule:
In the case of Speedy Q-learning (SQL), as peviously stated, you maintain two Q-functions and apply the operation TQ
to both: TQ_{k}
and TQ_{k-1}
. Then the results of the previous operations are used in the SQL update rule:
Another point to highlight in the pseudo-code you post in your question, it that it corresponds with the synchronous version of SQL. This means that, in each time step k
, you need to generate the next state y
and update Q_{k+1}(x,a)
for all the existing state-actions pairs (x,a)
.
Upvotes: 1