Reputation: 11
A parametric/variable-length action model is provided in rllib examples. The example assumes the outputs are logits for a single Categorical action dist. How to getting this work with a more complex output?
For example, there are 200 different balls in a box. Every step 2 balls are picked and put back. The action space can be defined like Multidiscrete([200, 200]) or Tuple((spaces.Discrete(200), spaces.Discrete(200))).
There are 3 restrictions that make some actions invalid.
How to enforce these 3 constraints via action masking in rllib.
Assuming that there are 2 parts of our obs space. The first constraint is implict. The invalid action can be determined without observation space. For the second constraint, A real_obs marks each ball with a number indicating its color. Balls with the same number are not allowed to be picked together. For the third constraint, An action_mask which indicates if balls are allowed to pick.
Specifically, how to implement the action/observation space and the forward function in the custom model?
If my assumption of obs space is unfeasible. You can define your obs space and the corresponding custom model.
ParametricActionsModel example in rllib
Upvotes: 1
Views: 1246
Reputation: 1050
I had exaclty the same problem. The big Problem is the dependency between your two actions(e.g. can't take the same ball twice). So one thing you can do is multiply them, so you have one big action space of 200x200=40000. Then you are able to create the full mask in the env and pass it to the forward function for the masking. Other wise you need to work with dependent action sampling and distributions.
For me the multiplication was not an option as it would be to large. So I make it the following way:
Upvotes: 3