Reputation: 1
I'm doing training using Proximal Policy Optimization (PPO) using the package Stable-baselines3 found on Reference 1 below, and I'm facing this weird pattern of learning rate shown below (screenshot 1: Learning Pattern).
My action space is multibinary, and to restrict this multibinary space to some values I have put a penalty for my reward function when the chosen action is not inside my needed domain.
What I'm experiencing is these strange dips in the rolling accumulated reward vs the number of episodes. Also, I'm noticing that the learning is not improved after about 2000 episodes (shown in the zoom of my figure).
Does anybody know what could be the problem here?
I'm using the default configuration for neural network found in Reference 1. It has two hidden layers with 64 neurons and tanh activation function each. My input is of size 64 and output is multibinary of size 21. All inputs to my neural network are normalized between 0 and 1, and my learning rate is 0.001. Please help. Best Regards,
Reference 1 https://github.com/DLR-RM/stable-baselines3 Learning Pattern
Upvotes: 0
Views: 472
Reputation: 53
You can try lowering the clip range to 0.1 for example. This will restrict the policy update even more which could resolve the instability you observed.
Regarding why the learning is not improved, it depends on the specific task. Maybe it has already reached the optimal policy.
Upvotes: 0