Reputation: 120
I'm starting to play around with https://github.com/openai/baselines/, specifically the deepq algorithm. I wanted to do my own analysis of the parameters passed into the deepq.learn method.
The method has two parameters related to exploration - exploration_fraction
and exploration_final_eps
.
The way I understand it - exploration_fraction
determines how much of the training time does the algorithm spend exploring, and exploration_final_eps
drives the probability of taking a random action each time explores. So - the number of random actions taken for the sake of exploring is a product of exploration_fraction
and exploration_final_eps
. Is that correct?
Can someone provide an explanation (in layman terms) of how the algorithm explores, based on these two parameters?
Upvotes: 0
Views: 829
Reputation: 6689
Your understanding is almost correct. The probability p
of taking a random action (i.e., an exploratory action) is a number that often starts high and decreases over time. This makes sense because at the beginning of the learning stage the learning policy is still useless, but it gets better as the learning progresses.
Taking that into account, exploration_fraction
and exploration_final_eps
are the parameters that control how probability p
decreases over time. When you explore the code in the repo, you find the following lines:
# Create the schedule for exploration starting from 1.
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
initial_p=1.0,
final_p=exploration_final_eps)
Here it's easier to understand the meaning of exploration_fraction
and exploration_final_eps
:
exploration_fraction
determines for how long (in timesteps) p
is decreasing. Notice that in this case p=1
initially, but this initial value may vary.exploration_final_eps
determines the minimum value of p
. Once the probability has decreased during the period indicated by exploration_fraction
, p
will remain fix with a value equal to exploration_final_eps
.Sometimes p
decreaes linearly, such as in the case of LinearSchedule
, but other ways are also possible.
Upvotes: 3