Reputation: 1
I hope somebody can help me. I'm implementing a basic Vanilla Policy Gradient algorithm for the CartPole-v1 gymnasium environment, and I don't know what I'm doing wrong. No matter what I try, during the training loop the loss decreases (so the model is actually learning something), but the episode total reward also decreases until it reaches around 9-10 steps (which I imagine is the minimum number of steps needed to make the pole fall). So it's learning to do it bad!
I don't know if it's something to do with the signs, the way I compute the loss, the optimizer... I have no idea.
For the discounted rewards I'm using
$ Q_{k,t} = \sum_{i=0}{\gamma^{i-t} r_i} $
And for the loss:
$ L = -\sum_{k,t}Q_{k,t}log\pi_{\theta}(a_t | s_t)$
The code is a mix from Maxim Lapan's Deep RL Hands-On book, Karpathy's Pong example (blog, code), and personal tweaks.
Here's my code:
import gymnasium as gym
import torch
from torch import nn
import torch.nn.functional as F
from torch.nn.init import xavier_uniform_
import numpy as np
GAMMA = 0.99
LEARNING_RATE = 0.001
BATCH_SIZE = 4
DEVICE = torch.device('mps')
class XavierLinear(nn.Linear):
def __init__(self, in_features: int, out_features: int, bias: bool = True, device=None, dtype=None) -> None:
super().__init__(in_features, out_features, bias, device, dtype)
xavier_uniform_(self.weight)
class VPG(nn.Module):
def __init__(self, input_size, output_size):
super(VPG, self).__init__()
self.net = nn.Sequential(
XavierLinear(input_size, 128),
nn.ReLU(),
XavierLinear(128, output_size),
)
def forward(self, x):
return F.softmax(self.net(x), dim=0)
def run_episode(model, env):
obs = env.reset()[0]
obs = torch.Tensor(env.reset()[0]).to(DEVICE)
te = tr = False
rewards, outputs, actions = [], [], []
while not (te or tr):
probs = model(obs)
action = probs.multinomial(1).item()
obs, r, te, tr, _ = env.step(action)
obs = torch.Tensor(obs).to(DEVICE)
if (te or tr):
r = 0
rewards.append(r)
outputs.append(probs)
actions.append(action)
return torch.Tensor(rewards).to(DEVICE), torch.concatenate(outputs).reshape(len(rewards), 2), actions
def discount_rewards(rewards):
discounted_r = torch.zeros_like(rewards)
additive_r = 0
for idx in range(len(rewards)-1, -1, -1):
to_add = GAMMA * additive_r
additive_r = to_add + rewards[idx]
discounted_r[idx] = additive_r
return discounted_r.to(DEVICE)
def loss_function(discounted_r, probs, actions):
logprobs = torch.log(probs)
selected = logprobs[range(probs.shape[0]), actions]
# discounted_r = (discounted_r - discounted_r.mean()) / discounted_r.std()
weighted = selected * discounted_r
return -weighted.sum()
# The actual training loop:
episode_total_reward = 0
batch_losses = torch.Tensor().to(DEVICE)
batch_actions = []
batch_disc_r = torch.Tensor().to(DEVICE)
batch_probs = torch.Tensor().to(DEVICE)
best_ep_reward = 0
losses, ep_total_lenghts = [], [0]
episodes = 0
TARGET_REWARD = 100
env = gym.make("CartPole-v1")
model = VPG(env.observation_space.shape[0],
2).to(DEVICE)
optim = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
while np.array(ep_total_lenghts)[-100:].mean() < TARGET_REWARD:
rewards, probs, actions = run_episode(model, env)
discounted_r = discount_rewards(rewards)
episode_total_reward = rewards.shape[0]
ep_total_lenghts.append(episode_total_reward)
episodes += 1
batch_actions += actions
batch_disc_r = torch.concatenate([batch_disc_r, discounted_r])
batch_probs = torch.concatenate([batch_probs, probs])
if episodes % BATCH_SIZE == 0:
loss = loss_function(batch_disc_r, batch_probs, batch_actions)
losses.append(loss.item())
model.zero_grad()
loss.backward()
optim.step()
batch_actions = []
batch_disc_r = torch.Tensor().to(DEVICE)
batch_probs = torch.Tensor().to(DEVICE)
print(f"Episode {episodes}. Loss: {loss}. Reward: {episode_total_reward}")
print(f"Success in {episodes} episodes. Loss: {loss}. Reward: {episode_total_reward}")
Tried: changing signs in loss functions, changing rewards (non-terminal step = 0 and terminal step = -1), updating manually the weights (adding the gradient or substracting it...). In each case I get the same: the loss decreases but the agent doens't learn to keep the pole up.
Expectation: Loss decreases and episode total reward (steps played) increases.
EDIT: I finally could fix the problem by applying these changes:
run_episode
, I change the reward for -1 if the episode is terminated and 0 otherwise:r = -1 if te else 0
loss_function
:discounted_r = (discounted_r - discounted_r.mean()) / discounted_r.std()
loss_function
return the negative of the mean instead of the negative of the sum:return - wieghted.mean()
With those changes I could fix the problem. Still I don't know why before it was decreasing the loss, but performing worse and worse. It was kinda learning backwards :).
Upvotes: 0
Views: 55
Reputation: 4081
The only thing clearly jumping out to me right now is the VPG.forward
.
You're doing a softmax over dim=0, but that would be the batch usually. You want to take the softmax over the action space to determine which action to take (probabilistically or if using eps-greedy strategy, etc). So instead, try changing to dim=-1 like this:
class VPG(nn.Module):
def __init__(self, input_size, output_size):
super(VPG, self).__init__()
self.net = nn.Sequential(
XavierLinear(input_size, 128),
nn.ReLU(),
XavierLinear(128, output_size),
)
def forward(self, x):
return F.softmax(self.net(x), dim=-1) # softmax over action space
You're also resetting the environment twice which doesn't need to happen, but that shouldn't cause the effect you're seeing.
Upvotes: 0