Bromine
Bromine

Reputation: 21

How to find the true Q-value and overestimation bias in actor critic

I am trying to plot the overestimation bias of the critics in DDPG and TD3 models. So essentially there is a critic_target and a critic network. I want to understand how does one go about finding the overestimation bias of the critic with the true Q value? and also how to find the true Q value?

I see in the original TD3 paper (https://arxiv.org/pdf/1802.09477.pdf) that the author measures the overestimation bias of the value networks. Can someone guide me in plotting the same during the training phase of my actor-critic model?

Upvotes: 1

Views: 369

Answers (1)

Bromine
Bromine

Reputation: 21

Answering my own question: Essentially during the training phase at each evaluation period (example: every 5000 steps) we can call a function to do this which performs as follows. Keep in mind the policy is kept fixed throughout this run.

pseudocode is as follows

import gym

def get_estimation_values(policy,env_name,gamma=0.99):
eval_env = gym.make(env_name)
state,done = eval.env.reset(),False
episode_reward = 0
max_steps=env.max_steps

#for example if there is only one critic like in DDPG
action = policy.actor(state)
estimated_Q = policy.critic(state,action) #This will be the estimated Q value for the starting state s0 

#The true Q value is given by : 
# Q(s0,a) = r_0 + gamma(Q(s1,a1))
# Q(s1,a1) = r_1 + gamma(Q(s2,a2))
# Q(s2,a2) = r_2 + gamma(Q(s3,a3)) and so on

# Therefore the true Q value can be written in this form:
# True_Q_value = r_0 + gamma(r1 + gamma(r2 + gamma(r3 + ....)))
# True_Q_value = r_0 + gamma*r1 + (gamma^2 * r2) + (gamma^3 *r3 ) .... until terminal state

# code to find true Q

true_Q = 0

for timesteps in range(max_steps):

        if(done):
            break


        #take action according to the current policy until done
        action = policy.actor(state) #maybe convert tensor to numpy if required
        next_state,reward,done,_ = eval_env.step(action)
        episode_reward+=0

        true_Q = true_Q+(gamma**t)*reward

return estimated_Q, true_Q

Upvotes: 1

Related Questions