pestopasta
pestopasta

Reputation: 120

How does exploration work in OpenAI Baselines?

I'm starting to play around with https://github.com/openai/baselines/, specifically the deepq algorithm. I wanted to do my own analysis of the parameters passed into the deepq.learn method.

The method has two parameters related to exploration - exploration_fraction and exploration_final_eps.

The way I understand it - exploration_fraction determines how much of the training time does the algorithm spend exploring, and exploration_final_eps drives the probability of taking a random action each time explores. So - the number of random actions taken for the sake of exploring is a product of exploration_fraction and exploration_final_eps. Is that correct?

Can someone provide an explanation (in layman terms) of how the algorithm explores, based on these two parameters?

Upvotes: 0

Views: 829

Answers (1)

Pablo EM
Pablo EM

Reputation: 6689

Your understanding is almost correct. The probability p of taking a random action (i.e., an exploratory action) is a number that often starts high and decreases over time. This makes sense because at the beginning of the learning stage the learning policy is still useless, but it gets better as the learning progresses.

Taking that into account, exploration_fraction and exploration_final_eps are the parameters that control how probability p decreases over time. When you explore the code in the repo, you find the following lines:

# Create the schedule for exploration starting from 1.
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
                             initial_p=1.0,
                             final_p=exploration_final_eps)

Here it's easier to understand the meaning of exploration_fraction and exploration_final_eps:

  • exploration_fraction determines for how long (in timesteps) p is decreasing. Notice that in this case p=1 initially, but this initial value may vary.
  • exploration_final_eps determines the minimum value of p. Once the probability has decreased during the period indicated by exploration_fraction, p will remain fix with a value equal to exploration_final_eps.

Sometimes p decreaes linearly, such as in the case of LinearSchedule, but other ways are also possible.

Upvotes: 3

Related Questions