Reputation: 1523
I use n-step Sarsa/sometimes Sarsa(lambda)
After experimenting a bit with different epsilon schedules I found out that the agent learns faster when I change the epsilon during an episode based on the number of steps already taken and the mean length of the last 10 episodes.
Low number of steps/beginning of episode => Low epsilon
High number of steps/end of episode => High epsilon
This works far better than just an epsilon decay over time from episode to episode.
Does the theory allow this?
I think yes because all states are still visited regularly.
Upvotes: 2
Views: 1341
Reputation: 6689
Yes, SARSA algorithm converges even in the case you are updating epsilon parameter within each episode. The requirement is that epsilon should eventually tend to zero or a small value.
In you case, if you are starting with a small epsilon value in each episode and increasing it as the number of steps grows, it's not very clear to me that your algorithm will converge towards an optimal policy. I mean, at some point epsilon should decrease.
The "best" epsilon schedule is highly problem dependent, and there is not a schedule that works fine in all problems. So, at the end, it's required some experience in the problem and probably some trial and error adjustment.
Upvotes: 3