Gulololo
Gulololo

Reputation: 11

Complex Action Mask in rllib

A parametric/variable-length action model is provided in rllib examples. The example assumes the outputs are logits for a single Categorical action dist. How to getting this work with a more complex output?

For example, there are 200 different balls in a box. Every step 2 balls are picked and put back. The action space can be defined like Multidiscrete([200, 200]) or Tuple((spaces.Discrete(200), spaces.Discrete(200))).

There are 3 restrictions that make some actions invalid.

  1. every time the 2 balls are different. So actions like (1,1) or (2,2) is invalid.
  2. Balls with same color are not allowed to be picked together. For example, the No.2 and No.3 ball are both yellow, so they cannot be picked together under some state. So action(1,2) is invalid under that state.
  3. Some balls are not allowed to be picked under specific state. For example, when the No.2 ball is marked Not Allowed to Pick,all actions with the No.2 ball like action (1, n) or (n,1) are invalid.

How to enforce these 3 constraints via action masking in rllib.

Assuming that there are 2 parts of our obs space. The first constraint is implict. The invalid action can be determined without observation space. For the second constraint, A real_obs marks each ball with a number indicating its color. Balls with the same number are not allowed to be picked together. For the third constraint, An action_mask which indicates if balls are allowed to pick.

Specifically, how to implement the action/observation space and the forward function in the custom model?

If my assumption of obs space is unfeasible. You can define your obs space and the corresponding custom model.

ParametricActionsModel example in rllib

Upvotes: 1

Views: 1246

Answers (1)

Michael Möbius
Michael Möbius

Reputation: 1050

I had exaclty the same problem. The big Problem is the dependency between your two actions(e.g. can't take the same ball twice). So one thing you can do is multiply them, so you have one big action space of 200x200=40000. Then you are able to create the full mask in the env and pass it to the forward function for the masking. Other wise you need to work with dependent action sampling and distributions.

For me the multiplication was not an option as it would be to large. So I make it the following way:

  1. Env creates a mask for Action 1 and XXX Masks for the depending Action 2.
  2. In the model you will sample the action 1 (with tf.random.categorical) with your action 1 mask
  3. Depending on the action 1 you select a mask for action 2 (tf.where) and sample action 2.
  4. The output of the model should be logits and the sampled action.
  5. You need to implement your own MultiCategorical action distribution to use your already sampled actions.

Upvotes: 3

Related Questions