Alfonso_MA
Alfonso_MA

Reputation: 555

stable_baselines3: why the reward does not match comparing ep_info_buffer vs evaluation?

I was working with stable_baselines3 library, when I found something that i did not expect.

Here a simple code to reproduce the issue:

import gymnasium as gym

from stable_baselines3 import DQN

env = gym.make("CartPole-v1")

model = DQN("MlpPolicy", env, verbose=0, stats_window_size=100_000)
model.learn(total_timesteps=100_000)

Taking a look at the last episode reward:

print(model.ep_info_buffer[-1])

{'r': 409.0, 'l': 409, 't': 54.87983}

But if I evaluate the model, with the following code:

obs, info = env.reset()
total_reward = 0
while True:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward = total_reward + reward
    if terminated or truncated:
        obs, info = env.reset()
        break

print("total_reward {}".format(total_reward))

total_reward 196.0

I get a different reward, what I did not expected.

I expected get the same 409 than in the model.ep_info_buffer[-1].

Why that difference? Is that .ep_info_buffer a different thing than the reward per episode?

Upvotes: 2

Views: 64

Answers (1)

Sachin Hosmani
Sachin Hosmani

Reputation: 1762

You are comparing the reward from the last episode of the training process with the reward from one episode of using the trained model. The resulting reward could be different because:

  1. During training DQN does some exploration based on the evolving state-action value function that it is learning (represented by an MLP in your example)

enter image description here

(picture taken from "Reinforcement Learning: An Introduction by Barto and Sutton")

But after training, you are using the trained model to generate episodes for which it no exploration would be done, as you are using deterministic=True. Even if you didn't use deterministic=True it wouldn't necessarily select actions the same way it did during training (for example it could have used ε-greedy during training and might use ε-soft during prediction).

  1. The state to which environment gets reset at the end of each episode during training might not match that you get when you reset it later with env.reset(). The stable baselines API doesn't seem to guarantee anything about this, so you shouldn't assume anything. In the case of this environment, there seems to be some randomness coming from here.

  2. In some cases, the environment itself is stochastic. Meaning that when you apply an action, the resulting state can itself be nondeterministic. This doesn't seem to be the case with the CarPole-v1 environment, but more generally you can't assume that all environments are similar.

Upvotes: 1

Related Questions