PyRsquared
PyRsquared

Reputation: 7338

Understanding the total_timesteps parameter in stable-baselines' models

I'm reading through the original PPO paper and trying to match this up to the input parameters of the stable-baselines PPO2 model.

One thing I do not understand is the total_timesteps parameter in the learn method.

The paper mentions

One style of policy gradient implementation... runs the policy for T timesteps (where T is much less than the episode length)

While the stable-baselines documentation describes the total_timesteps parameter as

(int) The total number of samples to train on

Therefore I would think that T in the paper and total_timesteps in the documentation are the same parameter.

What I do not understand is the following:

I'm still learning the terminology behind RL, so I hope I've been able to explain my question clearly above. Any help / tips would be very much welcomed.

Upvotes: 18

Views: 12724

Answers (1)

Per Arne Andersen
Per Arne Andersen

Reputation: 544

According to the stable-baselines source code

  • total_timesteps is the number of steps in total the agent will do for any environment. The total_timesteps can be across several episodes, meaning that this value is not bound to some maximum.
  • Let's say you have an environment with more than 1000 timesteps. If you call the learn function once, you would only experience the first 1000 frames, and the remaining part of the episode is unknown. In many experiments, you know how many timesteps the environment should last (i.e CartPole), but for environments with unknown length, this becomes less useful. BUT. If you call the learn function twice and say the environment episode had 1500 frames, you would see a full episode + 50 % of the 2nd.
  • An episode is defined to the extent of when the terminal flag is set to true (In gym, this is often set after a max timestep as well) Many other RL implementations use total_episodes instead so that you do not have to care about time step consideration, but again, the downside would be that you could end up with only running 1 episode if you hit an absorbing state.

The total timestep argument also uses n_steps where the number of updates is calculated based as follows:

n_updates = total_timesteps // self.n_batch

where n_batch is n_steps times the number of vectorised environments.

This means that if you were to have 1 environment running with n_step set to 32 and total_timesteps = 25000, you would do 781 updates to your policy during the learn call (excluding epochs, as PPO can do several updates on a single batch)

The lesson is:

  • For unknown sized envs, you would have to play with this value. Maybe create a running average episode length and use this value
  • Where the episode length is known, set it to the desired number of episode you would like to train. However, it might be less because the agent might not (probably wont) reach max steps every time.
  • TLDR play with the value (treat it as a hyperparameter)

Hope this helps!

Upvotes: 20

Related Questions