Reputation: 1202
I have tried to implement Proximal Policy Optimization with Intrinsic Curiosity Rewards for statefull LSTM neural network.
Losses in both PPO and ICM are diverging and I would like to find out if its bug in code or badly selected hyperparameters.
I have used https://github.com/adik993/ppo-pytorch as default code and reworked it to run on my environment and use LSTM
I may provide code samples later if specifically requested due to large amount of rows
def __init_curiosity(self):
curiosity_factory=ICM.factory(MlpICMModel.factory(), policy_weight=1,
reward_scale=0.1, weight=0.2,
intrinsic_reward_integration=0.01,
reporter=self.reporter)
self.curiosity = curiosity_factory.create(self.state_converter,
self.action_converter)
self.curiosity.to(self.device, torch.float32)
self.reward_normalizer = StandardNormalizer()
def __init_PPO_trainer(self):
self.PPO_trainer = PPO(agent = self,
reward = GeneralizedRewardEstimation(gamma=0.99, lam=0.95),
advantage = GeneralizedAdvantageEstimation(gamma=0.99, lam=0.95),
learning_rate = 1e-3,
clip_range = 0.3,
v_clip_range = 0.3,
c_entropy = 1e-2,
c_value = 0.5,
n_mini_batches = 32,
n_optimization_epochs = 10,
clip_grad_norm = 0.5)
self.PPO_trainer.to(self.device, torch.float32)
(Notice large numbers on y axis)
For now I have reworked LSTM processing to use batches and hidden memory on all places (for both main model and ICM), but the problem is still present. I have traced it to output from ICM's model, here the output diverges mainly in action_hat
tensor.
Upvotes: 0
Views: 896
Reputation: 1202
Found the problem... In main model I use softmax for eval runs and log_softmax for training in output layer and according to PyTorch docs the CrossEntropyLoss uses log_softmax inside, so as advised I used NLLLoss but forthe computation of ICM model loss which does not have softmax fnc in output layer! So switching back to CrossEntropyLoss (which was originaly in reference code) solved ICM loss divergence.
Upvotes: 1