Reputation: 95
I am using stableBaselines3 based on Open AI gym. The agent, in a toy problem version, tries to learn a given (fixed) target point (x and y coordinates within [0,31] and [0,25] respectively) on a screen.
My action space would thus be a box (Version A): self.action_space = ((gym.spaces.Box(np.array([0,0]),np.array([31,25]))))
. The reward obtained by the agent is minus the manhattan distance between the chosen point and target (the simulation terminates straight away). But when running the PPO algorithm, the agent seems to try only coordinates that are within the Box [0,0], [2,2] (ie coordinates are never bigger than 2). Nothing outside this box seems ever to be explored. The chosen policy is not even the best point within that box (typically (2,2)) but a random point within it.
When I normalize to [0,1] both axes, with (Version B) self.action_space = ((gym.spaces.Box(np.array([0,0]),np.array([1,1]))))
, and the actual coordinates are rescaled (the x-action is multiplied by 31, the y- by 25) the agent does now explore the whole box (I tried PPO and A2C). However, the optimal policy produced corresponds often to a corner (the corner closest to the target), in spite of better rewards having been obtained during training at some point. Only occasionally one of the coordinates is not a boundary, never both together.
If I try to discretize my problem: self.action_space = gym.spaces.MultiDiscrete([2,32,26])
, the agent correctly learns the best possible (x,y) action (nothing in the code from Version A changes except the action space). Obviously I'd like to not discretize.
What are possible reasons for that whole behavior (not exploring, considering only/mostly corners, moving away from better rewards)? The rest of the code is too unwieldy to paste here, but does not change between these scenarios except for the action space, so the fact that the discretized versions works does not fit with a bug with rewards calculations.
Finally, my action space would need to have one discrete component (whether the agent has found the target or will continue looking) on top of the two continuous components (x and y). The reward of a non-decisive fixation would be a small penalty, the reward of the final decision as above (the better the closer to the actual target). self.action_space = gym.spaces.Tuple((gym.spaces.Discrete(2),gym.spaces.Box(np.array([0,0]),np.array([31,25]),dtype=np.float32)))
should be what I'm looking for, but Tuple is not supported. Is there any workaround? What do people do when they need both continuous and discrete components? I thought of making the binary component into a float, and transforming it to 0/1 below/above a certain cutoff, but that can't lend itself too well to learning.
Upvotes: 2
Views: 653
Reputation: 95
For posterity, stable_baselines seems to be sampling actions in mysterious ways. If the action space is defined as [0,1] or [-1,-1], stable_baselines will indeed sample that space. But if the action space is, in my case, [0,31], then the actions sampled are roughly within [0,3] or [0,4], with most values being within [0,1].
So the workaround seems to be to use Boxes using [0,1] or [-1,-1] for the action_space, and rescale the action returned by whatever SB3 algorithm you're using.
Upvotes: 4