Reputation: 83
I have troubles understanding how to get data to compute the advantage in actor-critic settings.
I know that A(s,a) = Q(s,a) - V(s)
. It seems straightforward to get the state value estimate V(s)
, but how can we estimate Q(s,a)
given that the policy only outputs probabilities?
Thanks!
Upvotes: 1
Views: 2221
Reputation: 508
You should estimate Q(s,a) using the critic, not the actor.
Remember that in the actor-critic setting (e.g. A2C), the actor(s) will output the probability distribution over all your actions at state s
. From this distribution, you'll sample an action a
to take in the environment. Then, the environment will give you a reward r
and the next state s'
.
After N
steps, you'll use the critic to estimate the state value V(s)
and will calculate the advantage to point out how much better your actions were than the average for example. With the advantage, you'll update your policy (actor) to increase/decrease the probability of taking the action a
at state s
.
Therefore, to use your advantage function in this framework, you could use the critic to estimate Q(s,a)
, which is the value for each pair of action-state. Then, you can estimate V(s)
with:
You can take a look at this answer and at this post to have a better idea. Note that, to estimate Q(s,a)
your critic network should have |A|
output units, instead of just one as in the case of V(s)
. There are also other options to try as your advantage function.
Remember that the only purpose of the advantage function is to tell your model how much to increase/decrease the probabilities of taking action a
at state s
. If it's better than average you increase otherwise you decrease.
This paper is a very good reference.
Upvotes: 0
Reputation: 77847
The Q function depends on the availability of reward values for each future state. Q(s, a) is the value of taking action a
and evaluating the resulting V(s')
for the new state s'
. Thus, the net advantage will be the sum across all actions a
of P(a) * V(s'(a))
, where s'(a)
is the state reached by taking action a
from stat s
.
Remember, this is only a value estimate; that's where the training iterations prove their worth. You keep iterating until the values converge to a stable Markov model.
Upvotes: 1