Understanding the total_timesteps parameter in stable-baselines' models

Question

I'm reading through the original PPO paper and trying to match this up to the input parameters of the stable-baselines PPO2 model.

One thing I do not understand is the total_timesteps parameter in the learn method.

The paper mentions

One style of policy gradient implementation... runs the policy for T timesteps (where T is much less than the episode length)

While the stable-baselines documentation describes the total_timesteps parameter as

(int) The total number of samples to train on

Therefore I would think that T in the paper and total_timesteps in the documentation are the same parameter.

What I do not understand is the following:

Does total_timesteps always need to be less than or equal to the total number of available "frames" (samples) in an environment (say if I had a finite number of frames like 1,000,000). If so, why?
By setting total_timesteps to a number less than the number of available frames, what portion of the training data does the agent see? For example, if total_timesteps=1000, does the agent only ever see the first 1000 frames?
Is an episode defined as the total number of available frames, or is it defined as when the agent first "looses" / "dies"? If the latter, then how can you know in advance when the agent will die to be able set total_timesteps to a lesser value?

I'm still learning the terminology behind RL, so I hope I've been able to explain my question clearly above. Any help / tips would be very much welcomed.

Per Arne Andersen · Accepted Answer

According to the stable-baselines source code

total_timesteps is the number of steps in total the agent will do for any environment. The total_timesteps can be across several episodes, meaning that this value is not bound to some maximum.
Let's say you have an environment with more than 1000 timesteps. If you call the learn function once, you would only experience the first 1000 frames, and the remaining part of the episode is unknown. In many experiments, you know how many timesteps the environment should last (i.e CartPole), but for environments with unknown length, this becomes less useful. BUT. If you call the learn function twice and say the environment episode had 1500 frames, you would see a full episode + 50 % of the 2nd.
An episode is defined to the extent of when the terminal flag is set to true (In gym, this is often set after a max timestep as well) Many other RL implementations use total_episodes instead so that you do not have to care about time step consideration, but again, the downside would be that you could end up with only running 1 episode if you hit an absorbing state.

The total timestep argument also uses n_steps where the number of updates is calculated based as follows:

n_updates = total_timesteps // self.n_batch

where n_batch is n_steps times the number of vectorised environments.

This means that if you were to have 1 environment running with n_step set to 32 and total_timesteps = 25000, you would do 781 updates to your policy during the learn call (excluding epochs, as PPO can do several updates on a single batch)

The lesson is:

For unknown sized envs, you would have to play with this value. Maybe create a running average episode length and use this value
Where the episode length is known, set it to the desired number of episode you would like to train. However, it might be less because the agent might not (probably wont) reach max steps every time.
TLDR play with the value (treat it as a hyperparameter)

Hope this helps!

Understanding the total_timesteps parameter in stable-baselines' models

Answers (1)

Related Questions

Understanding the total_timesteps parameter in stable-baselines&#39; models

Answers (1)

Related Questions

Understanding the total_timesteps parameter in stable-baselines' models