menta man
menta man

Reputation: 66

Speedy Q-Learning

I've read on wikipedia https://en.wikipedia.org/wiki/Q-learning

Q-learning may suffer from slow rate of convergence, especially when the discount factor {\displaystyle \gamma } \gamma is close to one.[16] Speedy Q-learning, a new variant of Q-learning algorithm, deals with this problem and achieves a slightly better rate of convergence than model-based methods such as value iteration

So I wanted to try speedy q-learning, and see how better it is.

The only source about it I could find on the internet is this: https://papers.nips.cc/paper/4251-speedy-q-learning.pdf

That's the algorithm they suggest.

enter image description here

Now, I didn't understand it. what excactly is TkQk, Am I supposed to have another list of q-values? Is there any clearer explanation than this?

 Q[previousState][action] = ((Q[previousState][action]+(learningRate * ( reward + discountFactor * maxNextExpectedReward - Q[previousState][action]) )));

this is my current QLearning algorithm, I want to replace it to speedy Q-learning.

Upvotes: 0

Views: 1119

Answers (1)

Pablo EM
Pablo EM

Reputation: 6689

A first consideration: if you are trying to speed up Q-learning for a practical problem, I would choose other options before Speedy Q-learning, such as the well-known Q(lambda), i.e., Q-learning combined with elegibility traces. Why? Becuase there are tons of information and experimental (good) results with eligibility traces. In fact, as the Speedy Q-learning authors suggest, the working principle of both methods are similar:

The idea of using previous estimates of the action-values has already been used to improve the performance of Q-learning. A popular algorithm of this kind is Q(lambda) [14, 20], which incorporates the concept of eligibility traces in Q-learning, and has been empirically shown to have a better performance than Q-learning, i.e., Q(0), for suitable values of lambda.

You can find a nice introduction in the Sutton and Barto RL book. If you are simply interested in study the differences between Speedy Q-learning and the standard version, go on.

A now your question. Yes, you have to maintain two separate lists of Q-values, one for the current time k and another for the previous k-1, namely, Q_{k} and Q_{k-1} respectively.

In the common case (including your case), TQ_{k} = r(x,a) + discountFactor * max_{b in A} Q_{k}(y,b), where y is the next state and b the action that maximizes Q_{k}for the given state. Notice that you are using that operator in the standard Q-learning, which has the following update rule:

enter image description here

In the case of Speedy Q-learning (SQL), as peviously stated, you maintain two Q-functions and apply the operation TQ to both: TQ_{k} and TQ_{k-1}. Then the results of the previous operations are used in the SQL update rule:

enter image description here

Another point to highlight in the pseudo-code you post in your question, it that it corresponds with the synchronous version of SQL. This means that, in each time step k, you need to generate the next state y and update Q_{k+1}(x,a) for all the existing state-actions pairs (x,a).

Upvotes: 1

Related Questions