chink
chink

Reputation: 1653

Reward is converging but actions are not correct in reinforcement learning

I am developing a reinforcement learning agent.

My reward structure looks like

thermal_coefficient = -0.1

        zone_temperature = output[6]

        if zone_temperature < self.temp_sp_min:
            temp_penalty = self.temp_sp_min - zone_temperature
        elif zone_temperature > self.temp_sp_max:
            temp_penalty = zone_temperature - self.temp_sp_max
        else :
            temp_penalty = 0

my temp_sp_min is 23.7 and temp_sp_max is 24.5. When i train the agent based on epsilon greedy action selection strategy, after around 10000 episodes my rewards are converging, When I test the trained agent now, the actions taken by the agent doesn't make sense, meaning when zone_temperature is less than temp_sp_min it is taking an action, which further reduces zone_temperature.

I don't understand where am I going wrong. Can someone help me with this?

Thanks

Upvotes: 0

Views: 444

Answers (1)

Jad
Jad

Reputation: 109

It's normal that epsilon-greedy algorithm takes action that are not logical, indeed those actions are supposed to be exploration (actions taken with probability 1-epsilon).

But i think for your problem it's an algorithm for contextual MAB that you need, because your reward depends on a context/state (current temperature). Try other algorithm that performs better in such conditions as LinUCB or DQN.

Upvotes: 0

Related Questions