Reputation: 727
I am just getting started self-studying reinforcement-learning with stable-baselines 3. My long-term goal is to train an agent to play a specific turn-based boardgame. Currently I am quite overwhelmed with new stuff, though.
I have implemented a gym-environment that I can use to play my game manually or by having it pick random actions.
Currently I am stuck with trying to get a model to hand me actions in response to an observation. The action-space of my environment is a DiscreteSpace(256)
. I create the model with the environment as model = PPO('MlpPolicy', env, verbose=1)
. When I later call model.predict(observation)
I do get back a number that looks like an action. When run repeatedly I get different numbers, which I assume is to be expected on an untrained model.
Unfortunately in my game most of the actions are illegal in most states and I would like to filter them and pick the best legal one. Or simply dump the output result for all the actions out to get an insight on what's happening.
In browsing other peoples code I have seen references to model.action_probability(observation)
. Unfortunately method is not part of stable baselines 3 as far as I can tell. The guide for migration from stable baselines 2 to v3 only mentions it not being implemented [1].
Can you give me a hint on how to go on?
Upvotes: 9
Views: 3045
Reputation: 2655
About this point.
when I later call model.predict(observation) I do get back a number that looks like an action.
You can prevent that behavior with the following line
model.predict(observation, deterministic=True)
when you add deterministic=True, all the predicted actions will be always determined by the maximum probability, instead of the probability by itself.
Just to give you an example, let's suppose you have the following probabilities:
If you don't use the deterministic=True, the model will use those probabilities to return a prediction.
If you use deterministic=True, the model is going to return always action B.
Upvotes: 2
Reputation: 143
In case anyone comes across this post in the future, this is how you do it for PPO.
import numpy as np
from stable_baselines3.common.policies import obs_as_tensor
def predict_proba(model, state):
obs = obs_as_tensor(state, model.policy.device)
dis = model.policy.get_distribution(obs)
probs = dis.distribution.probs
probs_np = probs.detach().numpy()
return probs_np
Upvotes: 5