ϵ-greedy policy with decreasing rate of exploration

Question

I want to implement ϵ-greedy policy action-selection policy in Q-learning. Here many people have used, following equation for decreasing rate of exploration,

ɛ = e^(-En)

n = the age of the agent

E = exploitation parameter

But I am not clear what does this "n" means? is that number of visits to a particular state-action pair OR is that the number of iterations?

Thanks a lot

Pablo EM · Accepted Answer

There are several valid answers for your question. From the theoretical point of view, in order to achieve convergence, Q-learning requires that all the state-action pairs are (asymptotically) visited infinitely often.

The previous condition can be achieved in many ways. In my opinion, it's more common to interpret n simply as the number of time steps, i.e., how many interactions the agent has performed with the environment [e.g., Busoniu, 2010, Chapter 2].

However, in some cases the rate of exploration can be different for each state, and therefore n is the number of times the agent has visited state s [e.g., Powell, 2011, chapter 12].

Both interpreations are equally valid and ensure (together other conditions) the asymptotic convergence of Q-learning. When is better to use some approach or another depends on your particular problem, similar to what exact value of E you should use.

ϵ-greedy policy with decreasing rate of exploration

Answers (1)

Related Questions