Reputation: 555
I was working with stable_baselines3 library, when I found something that i did not expect.
Here a simple code to reproduce the issue:
import gymnasium as gym
from stable_baselines3 import DQN
env = gym.make("CartPole-v1")
model = DQN("MlpPolicy", env, verbose=0, stats_window_size=100_000)
model.learn(total_timesteps=100_000)
Taking a look at the last episode reward:
print(model.ep_info_buffer[-1])
{'r': 409.0, 'l': 409, 't': 54.87983}
But if I evaluate the model, with the following code:
obs, info = env.reset()
total_reward = 0
while True:
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
total_reward = total_reward + reward
if terminated or truncated:
obs, info = env.reset()
break
print("total_reward {}".format(total_reward))
total_reward 196.0
I get a different reward, what I did not expected.
I expected get the same 409 than in the model.ep_info_buffer[-1].
Why that difference? Is that .ep_info_buffer a different thing than the reward per episode?
Upvotes: 2
Views: 64
Reputation: 1762
You are comparing the reward from the last episode of the training process with the reward from one episode of using the trained model. The resulting reward could be different because:
(picture taken from "Reinforcement Learning: An Introduction by Barto and Sutton")
But after training, you are using the trained model to generate episodes for which it no exploration would be done, as you are using deterministic=True
. Even if you didn't use deterministic=True
it wouldn't necessarily select actions the same way it did during training (for example it could have used ε-greedy during training and might use ε-soft during prediction).
The state to which environment gets reset at the end of each episode during training might not match that you get when you reset it later with env.reset()
. The stable baselines API doesn't seem to guarantee anything about this, so you shouldn't assume anything. In the case of this environment, there seems to be some randomness coming from here.
In some cases, the environment itself is stochastic. Meaning that when you apply an action, the resulting state can itself be nondeterministic. This doesn't seem to be the case with the CarPole-v1 environment, but more generally you can't assume that all environments are similar.
Upvotes: 1