Reputation: 119
In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients
with the importance sampling term of the policy output probability over the old policy output probability
Could someone please explain this step to me?
I understand once we have done this why we then need to constrain the updates within a 'trust region' (to avoid the πθold increasing the gradient updates outwith the bounds in which the approximations of the gradient direction are accurate), I'm just not sure of the reasons behind including this term in the first place.
Upvotes: 1
Views: 936
Reputation: 5412
The original formulation of PG does not have the log
, it is just E[pi*A]
. The log
is used for numerical stability, since it does not change the optimum.
The importance sampling term must be used because you are maximizing pi
(the new policy) but you have only samples from the current policy pi_old
. So basically what IS does it
integral pi*A
pi
but only from pi_old
integral pi/pi_old*pi_old*A
integral pi/pi_old*A
approximated with samples from pi_old
.This is also useful is you want to store samples collected during previous iterations and still use them to update your policy.
However, this naive importance sampling is usually unstable, especially if your current policy is much different from the previous one. In PPO and TRPO it works well because the policy update is constrained (with KL divergence in TRPO and by clipping the IS ratio in PPO).
This is a nice book chapter for understanding importance sampling.
Upvotes: 3
Reputation: 7608
TRPO and PPO keep optimizing the policy without sampling again.
That means that the data that is used to estimate the gradient has been sampled with a different policy (pi_old). In order to correct for the difference between the sampling policy and the policy that is being optimize, an importance sampling ratio needs to be applied.
Upvotes: 1