Reputation: 7338
I'm reading through the original PPO paper and trying to match this up to the input parameters of the stable-baselines PPO2 model.
One thing I do not understand is the total_timesteps
parameter in the learn
method.
The paper mentions
One style of policy gradient implementation... runs the policy for T timesteps (where T is much less than the episode length)
While the stable-baselines documentation describes the total_timesteps
parameter as
(int) The total number of samples to train on
Therefore I would think that T
in the paper and total_timesteps
in the documentation are the same parameter.
What I do not understand is the following:
Does total_timesteps
always need to be less than or equal to the total number of available "frames" (samples) in an environment (say if I had a finite number of frames like 1,000,000). If so, why?
By setting total_timesteps
to a number less than the number of available frames, what portion of the training data does the agent see? For example, if total_timesteps=1000
, does the agent only ever see the first 1000 frames?
Is an episode defined as the total number of available frames, or is it defined as when the agent first "looses" / "dies"? If the latter, then how can you know in advance when the agent will die to be able set total_timesteps
to a lesser value?
I'm still learning the terminology behind RL, so I hope I've been able to explain my question clearly above. Any help / tips would be very much welcomed.
Upvotes: 18
Views: 12724
Reputation: 544
According to the stable-baselines source code
The total timestep argument also uses n_steps where the number of updates is calculated based as follows:
n_updates = total_timesteps // self.n_batch
where n_batch is n_steps times the number of vectorised environments.
This means that if you were to have 1 environment running with n_step set to 32 and total_timesteps = 25000, you would do 781 updates to your policy during the learn call (excluding epochs, as PPO can do several updates on a single batch)
The lesson is:
Hope this helps!
Upvotes: 20