Mike de Klerk
Mike de Klerk

Reputation: 12328

What is the difference between the `policy` and `collect_policy` of a tf-agent?

I am looking at tf-agents to learn about reinforcement learning. I am following this tutorial. There is a different policy used, called collect_policy for training than for evaluation (policy).

The tutorial states there is a difference, but in IMO it does not describe the why of having 2 policies as it does not describe a functional difference.

Agents contain two policies:

agent.policy — The main policy that is used for evaluation and deployment.

agent.collect_policy — A second policy that is used for data collection.

I've looked at the source code of the agent. It says

policy: An instance of tf_policy.Base representing the Agent's current policy.

collect_policy: An instance of tf_policy.Base representing the Agent's current data collection policy (used to set self.step_spec).

But I do not see self.step_spec anywhere in the source file. The next closest thing I find is time_step_spec. But that is the first ctor argument of the TFAgent class, so that makes no sense to set via a collect_policy.

So the only thing I can think of was: put it to the test. So I used policy instead of collect_policy for training. And the agent reached the max score in the environment nonetheless.

So what is the functional difference between the two policies?

Upvotes: 5

Views: 1284

Answers (1)

nbro
nbro

Reputation: 15847

There are some reinforcement learning algorithms, such as Q-learning, that use a policy to behave in (or interact with) the environment to collect experience, which is different than the policy they are trying to learn (sometimes known as the target policy). These algorithms are known as off-policy algorithms. An algorithm that is not off-policy is known as on-policy (i.e. the behaviour policy is the same as the target policy). An example of an on-policy algorithm is SARSA. That's why we have both policy and collect_policy in TF-Agents, i.e., in general, the behavioural policy can be different than the target policy (though this may not always be the case).

Why should this be the case? Because during learning and interaction with environment, you need to explore the environment (i.e. take random actions), while, once you have learned the near-optimal policy, you may not need to explore anymore and can just take the near-optimal action (I say near-optimal rather than optimal because you may not have learned the optimal one)

Upvotes: 3

Related Questions