Reputation: 1653
I am developing a reinforcement learning agent.
My reward structure looks like
thermal_coefficient = -0.1
zone_temperature = output[6]
if zone_temperature < self.temp_sp_min:
temp_penalty = self.temp_sp_min - zone_temperature
elif zone_temperature > self.temp_sp_max:
temp_penalty = zone_temperature - self.temp_sp_max
else :
temp_penalty = 0
my temp_sp_min
is 23.7 and temp_sp_max
is 24.5. When i train the agent based on epsilon greedy action selection strategy, after around 10000 episodes my rewards are converging, When I test the trained agent now, the actions taken by the agent doesn't make sense, meaning when zone_temperature
is less than temp_sp_min
it is taking an action, which further reduces zone_temperature.
I don't understand where am I going wrong. Can someone help me with this?
Thanks
Upvotes: 0
Views: 444
Reputation: 109
It's normal that epsilon-greedy algorithm takes action that are not logical, indeed those actions are supposed to be exploration (actions taken with probability 1-epsilon).
But i think for your problem it's an algorithm for contextual MAB that you need, because your reward depends on a context/state (current temperature). Try other algorithm that performs better in such conditions as LinUCB or DQN.
Upvotes: 0