Reputation: 355
I want to implement ϵ-greedy policy action-selection policy in Q-learning. Here many people have used, following equation for decreasing rate of exploration,
ɛ = e^(-En)
n = the age of the agent
E = exploitation parameter
But I am not clear what does this "n" means? is that number of visits to a particular state-action pair OR is that the number of iterations?
Thanks a lot
Upvotes: 2
Views: 923
Reputation: 6689
There are several valid answers for your question. From the theoretical point of view, in order to achieve convergence, Q-learning requires that all the state-action pairs are (asymptotically) visited infinitely often.
The previous condition can be achieved in many ways. In my opinion, it's more common to interpret n
simply as the number of time steps, i.e., how many interactions the agent has performed with the environment [e.g., Busoniu, 2010, Chapter 2].
However, in some cases the rate of exploration can be different for each state, and therefore n
is the number of times the agent has visited state s
[e.g., Powell, 2011, chapter 12].
Both interpreations are equally valid and ensure (together other conditions) the asymptotic convergence of Q-learning. When is better to use some approach or another depends on your particular problem, similar to what exact value of E
you should use.
Upvotes: 3