Reputation: 12328
I am looking at tf-agents to learn about reinforcement learning. I am following this tutorial. There is a different policy used, called collect_policy
for training than for evaluation (policy
).
The tutorial states there is a difference, but in IMO it does not describe the why of having 2 policies as it does not describe a functional difference.
Agents contain two policies:
agent.policy — The main policy that is used for evaluation and deployment.
agent.collect_policy — A second policy that is used for data collection.
I've looked at the source code of the agent. It says
policy: An instance of
tf_policy.Base
representing the Agent's current policy.collect_policy: An instance of
tf_policy.Base
representing the Agent's current data collection policy (used to setself.step_spec
).
But I do not see self.step_spec
anywhere in the source file. The next closest thing I find is time_step_spec
. But that is the first ctor argument of the TFAgent
class, so that makes no sense to set via a collect_policy
.
So the only thing I can think of was: put it to the test. So I used policy
instead of collect_policy
for training. And the agent reached the max score in the environment nonetheless.
So what is the functional difference between the two policies?
Upvotes: 5
Views: 1284
Reputation: 15847
There are some reinforcement learning algorithms, such as Q-learning, that use a policy to behave in (or interact with) the environment to collect experience, which is different than the policy they are trying to learn (sometimes known as the target policy). These algorithms are known as off-policy algorithms. An algorithm that is not off-policy is known as on-policy (i.e. the behaviour policy is the same as the target policy). An example of an on-policy algorithm is SARSA. That's why we have both policy
and collect_policy
in TF-Agents, i.e., in general, the behavioural policy can be different than the target policy (though this may not always be the case).
Why should this be the case? Because during learning and interaction with environment, you need to explore the environment (i.e. take random actions), while, once you have learned the near-optimal policy, you may not need to explore anymore and can just take the near-optimal action (I say near-optimal rather than optimal because you may not have learned the optimal one)
Upvotes: 3