Reward is converging but actions are not correct in reinforcement learning

Question

I am developing a reinforcement learning agent.

My reward structure looks like

thermal_coefficient = -0.1

        zone_temperature = output[6]

        if zone_temperature < self.temp_sp_min:
            temp_penalty = self.temp_sp_min - zone_temperature
        elif zone_temperature > self.temp_sp_max:
            temp_penalty = zone_temperature - self.temp_sp_max
        else :
            temp_penalty = 0

my temp_sp_min is 23.7 and temp_sp_max is 24.5. When i train the agent based on epsilon greedy action selection strategy, after around 10000 episodes my rewards are converging, When I test the trained agent now, the actions taken by the agent doesn't make sense, meaning when zone_temperature is less than temp_sp_min it is taking an action, which further reduces zone_temperature.

I don't understand where am I going wrong. Can someone help me with this?

Thanks

Reward is converging but actions are not correct in reinforcement learning

Answers (1)

Related Questions